How Sybil transforms raw documents into structured Business Requirements Documents — from infrastructure to AI agent orchestration to data preprocessing.
Cloud-native architecture on Google Cloud Platform with Vercel edge delivery and Terraform-managed infrastructure.
Fully agentic BRD generation and natural language editing — the AI plans its own workflow, reads documents, and writes structured requirements.
User clicks “Generate BRD”
POST /brds/generate — fire-and-forget background task, returns 202 Accepted
AI discovers documents
Calls list_project_documents() — gets doc list with AI metadata (summaries, tags, topics)
AI reads full document text
Calls get_full_document_text() — reads entire document, not chunks. No RAG = no context loss.
AI cross-references documents
Calls search_documents_by_topic() and search_documents_by_content()
AI writes 13 BRD sections
Calls submit_brd_section() ×13 — virtual tool (intercepted, not executed). Each section includes markdown content + citations.
AI submits conflict + sentiment analysis
Calls submit_analysis() — detects contradictions across documents, extracts stakeholder sentiment, lists key concerns
Backend assembles BRD
Collects all virtual tool outputs → builds BRD model → stores in Firestore. Frontend polls and detects new BRD.
Gemini API cannot combine tools (function calling) with response_mime_type: "application/json" (structured output) in the same request. Our workaround: define virtual tools that the AI “calls” — but the backend intercepts the function call arguments as structured data instead of executing them. This gives us both agentic tool-calling AND structured output in one pipeline.
User Interaction
Backend Processing
Pydantic → Regex injection scan → Defensive prompt wrapping
AI calls submit_response(content, type)
refinement/generation → “Accept & Replace” bar | answer → plain chat message
Multi-tier NLP funnel that filters 517K Enron emails down to curated project datasets using heuristic scoring and semantic embeddings.
enron_loader.py
Stream CSV in 5000-row batches
Parse RFC 822 headers + body
Deduplication
Key: subject + sender + date
Normalize Re:/FW: prefixes
Parallel
multiprocessing.Pool
CPU-bound parsing distributed
Positive Signals
Negative Signals
Gemini text-embedding-004
768-dim vectors for all emails + 10 seed queries
Cosine Similarity
Each email scored against all BRD seed queries
Combined score: 0.3 × heuristic + 0.7 × embedding
Cost: ~$0.10 for 50K emails • Speed: ~5 min with batching + concurrency
Export filtered emails as .txt files → Upload to Sybil API in batches (5 files/batch, 2s delay) → Chomper parses → Gemini classifies → Ready for BRD generation
Thread Scoring Formula
score = email_count × capped_senders × log2(avg_words)
× project_indicator_bonus (10x)
× brd_signal_density_bonus (1-4x)
× blast_email_penalty (0.5x)Output per Project
Rank, name, discovery score, email count, unique senders, extracted keywords (TF, no IDF), 5 auto-generated seed queries for embedding filter
| Technique | Location | Purpose |
|---|---|---|
| Gemini text-embedding-004 | embedding_filter.py | Semantic email representation (768-dim vectors) |
| Cosine similarity | embedding_filter.py | Relevance scoring vs BRD seed queries |
| Term frequency (TF) | eda_discover.py | Keyword extraction (no IDF, stopwords removed) |
| Weighted feature scoring | heuristic_filter.py | Rule-based relevance classification |
| Hybrid ranking | curate_project.py | 0.3 heuristic + 0.7 embedding combination |
No custom models were trained. The pipeline uses pre-trained Gemini embeddings combined with hand-crafted heuristic features — standard NLP practice when labeled training data is unavailable.
Clean layered architecture — routes handle HTTP, services own business logic, models enforce schemas. Fully async with fire-and-forget background processing.
Every I/O operation is async — Firestore AsyncClient, Cloud Storage uploads, Gemini API calls. Heavy processing (document parsing, BRD generation) runs as BackgroundTasks — the API returns 202 Accepted immediately and the frontend polls for completion. Zero blocking in request handlers.
| Method | Endpoint | Purpose | Pattern |
|---|---|---|---|
| POST | /auth/token | Authenticate user | sync |
| POST | /projects | Create project | sync |
| GET | /projects/{id} | Get project details | sync |
| POST | /projects/{id}/documents/upload | Upload document | background |
| GET | /projects/{id}/documents | List documents | sync |
| POST | /projects/{id}/brds/generate | Generate BRD | background |
| GET | /projects/{id}/brds | List BRDs | sync |
| POST | /projects/{id}/brds/{brd_id}/chat | Chat / Refine BRD | sync |
| POST | /deletions/preview | Preview cascade delete | sync |
| POST | /deletions/confirm | Execute deletion | background |
5-stage async pipeline — upload, store, parse, analyze in parallel, finalize. Transforms raw files into AI-enriched, searchable document records.
ID Generation
Generate unique doc_id
Prefix-based (doc_*)
Firestore Write
Create document record
status: “uploading”
BackgroundTask
Remaining stages run async
Client polls GET for status
File Storage
Write raw bytes to Cloud Storage
Path: projects/{id}/documents/{filename}
Status Update
Firestore status → “processing”
Frontend shows spinner
Text Extraction
PDF, DOCX, PPTX, XLSX, CSV, HTML, TXT
Full document text
Chunking
Word-based splitting
1000 words, 100 overlap
Storage
Save to Cloud Storage
text_path + chunk_path
Gemini classifies document type:
summary — concise document overview
tags — content classification labels
topic_relevance — {topic: 0.0–1.0} scores
content_indicators — has_requirements, has_decisions, etc.
key_entities — stakeholders, features, decisions
Store all metadata in Firestore → Update document_count on project → Set status to “complete” (or “failed” with error message) → Frontend detects change on next poll
The AI metadata generated in Stage 4 is what makes the BRD agent intelligent. When the agent calls list_project_documents(), it sees topic relevance scores and content indicators — not just filenames. When it calls search_documents_by_topic(), the pre-computed topic scores enable instant filtering without re-reading documents. The pipeline transforms dumb files into searchable, classified, AI-enriched records.
Topic Relevance
AI generates topics from content — not predefined
Content Indicators
Boolean flags for fast document selection
Key Entities
Stakeholders
Features
Decisions