Sybil
Architecture

System Architecture

How Sybil transforms raw documents into structured Business Requirements Documents — from infrastructure to AI agent orchestration to data preprocessing.

01

Infrastructure

Cloud-native architecture on Google Cloud Platform with Vercel edge delivery and Terraform-managed infrastructure.

Client Tier
Next.js 14Vercel CDN + Edge
JWT AuthHS256 + bcrypt
FastAPICloud Run
Services Tier
FirestoreNoSQL Database
Cloud StorageDocument Files
Gemini 2.5 ProAI Engine
Artifact RegistryDocker Images
Infrastructure as Code
TerraformCloud Run + Registry
deploy.shBuild + Push + Apply
Next.js 14TypeScriptFastAPIPython 3.11FirestoreCloud StorageGemini 2.5 ProTerraformCloud RunVercelJWTshadcn/uiTailwind CSSZustandFramer Motion
02

AI Agent Flow

Fully agentic BRD generation and natural language editing — the AI plans its own workflow, reads documents, and writes structured requirements.

BRD Generation

fully agentic — max 30 iterations
1

User clicks “Generate BRD”

POST /brds/generate — fire-and-forget background task, returns 202 Accepted

2

AI discovers documents

Calls list_project_documents() — gets doc list with AI metadata (summaries, tags, topics)

real tool
3

AI reads full document text

Calls get_full_document_text() — reads entire document, not chunks. No RAG = no context loss.

real tool
4

AI cross-references documents

Calls search_documents_by_topic() and search_documents_by_content()

real tool
5

AI writes 13 BRD sections

Calls submit_brd_section() ×13 — virtual tool (intercepted, not executed). Each section includes markdown content + citations.

executive_summarybusiness_objectivesstakeholdersfunctional_requirementsnon_functional_requirementsassumptionssuccess_metricstimelineproject_backgroundproject_scopedependenciesriskscost_benefit
virtual tool
6

AI submits conflict + sentiment analysis

Calls submit_analysis() — detects contradictions across documents, extracts stakeholder sentiment, lists key concerns

virtual tool
7

Backend assembles BRD

Collects all virtual tool outputs → builds BRD model → stores in Firestore. Frontend polls and detects new BRD.

Why Virtual Tools?

Gemini API cannot combine tools (function calling) with response_mime_type: "application/json" (structured output) in the same request. Our workaround: define virtual tools that the AI “calls” — but the backend intercepts the function call arguments as structured data instead of executing them. This gives us both agentic tool-calling AND structured output in one pipeline.

Natural Language Editing

unified chat — max 8 iterations

User Interaction

Select text in BRD viewer
“Make this more concise”
Accept & Replace — or Iterate

Backend Processing

3-Layer Security

Pydantic → Regex injection scan → Defensive prompt wrapping

AI Self-Classification

AI calls submit_response(content, type)

refinementanswergeneration
Frontend Response

refinement/generation → “Accept & Replace” bar | answer → plain chat message

03

Data Preprocessing

Multi-tier NLP funnel that filters 517K Enron emails down to curated project datasets using heuristic scoring and semantic embeddings.

TIER 0Parse & Stream517,401 emails

enron_loader.py

Stream CSV in 5000-row batches

Parse RFC 822 headers + body

Deduplication

Key: subject + sender + date

Normalize Re:/FW: prefixes

Parallel

multiprocessing.Pool

CPU-bound parsing distributed

TIER 1Heuristic Filter517K → 78K (15%)

Positive Signals

+0.30BRD keywords in body
+0.20BRD keywords in subject
+0.15Targeted (1-10 recipients)
+0.15Substantial body (50-500 words)
+0.10Action language (?, "please review")
+0.10Good folder (inbox, sent)

Negative Signals

-0.30Noise keywords (lunch, birthday)
-0.20Noise subject patterns (FW: FW:)
-0.20Mass email (20+ recipients)
-0.15Noise folder (deleted_items, spam)
-0.10Trivially short (<15 words)
47 BRD keywords50+ generic subjects12 newsletter patterns9 junk folders
TIER 2Embedding Filter78K → 2K (top 2.5%)

Gemini text-embedding-004

768-dim vectors for all emails + 10 seed queries

Cosine Similarity

Each email scored against all BRD seed queries

Combined score: 0.3 × heuristic + 0.7 × embedding

Cost: ~$0.10 for 50K emails • Speed: ~5 min with batching + concurrency

TIER 3Export & Upload→ Sybil project

Export filtered emails as .txt files → Upload to Sybil API in batches (5 files/batch, 2s delay) → Chomper parses → Gemini classifies → Ready for BRD generation

Auto-Discovery

eda_discover.py — finds project threads automatically

Thread Scoring Formula

score = email_count × capped_senders × log2(avg_words)
        × project_indicator_bonus (10x)
        × brd_signal_density_bonus (1-4x)
        × blast_email_penalty (0.5x)

Output per Project

Rank, name, discovery score, email count, unique senders, extracted keywords (TF, no IDF), 5 auto-generated seed queries for embedding filter

ML Techniques Used

TechniqueLocationPurpose
Gemini text-embedding-004embedding_filter.pySemantic email representation (768-dim vectors)
Cosine similarityembedding_filter.pyRelevance scoring vs BRD seed queries
Term frequency (TF)eda_discover.pyKeyword extraction (no IDF, stopwords removed)
Weighted feature scoringheuristic_filter.pyRule-based relevance classification
Hybrid rankingcurate_project.py0.3 heuristic + 0.7 embedding combination

No custom models were trained. The pipeline uses pre-trained Gemini embeddings combined with hand-crafted heuristic features — standard NLP practice when labeled training data is unavailable.

04

Backend Architecture

Clean layered architecture — routes handle HTTP, services own business logic, models enforce schemas. Fully async with fire-and-forget background processing.

HTTP Layer
/authToken
/projectsCRUD
/documentsUpload
/brdsGenerate
/deletionsAsync
Service Layer — Business Logic
DocumentPipeline orchestration
BRD GenerationAgentic loop
Text RefinementUnified chat
FirestoreDatabase CRUD
StorageCloud Storage ops
GeminiAI API wrapper
AuthJWT + users
AI ServiceUtility functions
AgentTool executor
Data & External APIs
Pydantic ModelsType-safe schemas
FirestoreAsync client
Cloud StorageGCS bucket
Gemini APIgoogle-genai SDK

Async-First Design

Every I/O operation is async — Firestore AsyncClient, Cloud Storage uploads, Gemini API calls. Heavy processing (document parsing, BRD generation) runs as BackgroundTasks — the API returns 202 Accepted immediately and the frontend polls for completion. Zero blocking in request handlers.

API Endpoints

RESTful — /projects/{id}/resource pattern
MethodEndpointPurposePattern
POST/auth/tokenAuthenticate usersync
POST/projectsCreate projectsync
GET/projects/{id}Get project detailssync
POST/projects/{id}/documents/uploadUpload documentbackground
GET/projects/{id}/documentsList documentssync
POST/projects/{id}/brds/generateGenerate BRDbackground
GET/projects/{id}/brdsList BRDssync
POST/projects/{id}/brds/{brd_id}/chatChat / Refine BRDsync
POST/deletions/previewPreview cascade deletesync
POST/deletions/confirmExecute deletionbackground
05

Document Pipeline

5-stage async pipeline — upload, store, parse, analyze in parallel, finalize. Transforms raw files into AI-enriched, searchable document records.

STAGE 1Create Recordreturns 202 immediately

ID Generation

Generate unique doc_id

Prefix-based (doc_*)

Firestore Write

Create document record

status: “uploading”

BackgroundTask

Remaining stages run async

Client polls GET for status

STAGE 2Cloud Storage UploadGCS bucket

File Storage

Write raw bytes to Cloud Storage

Path: projects/{id}/documents/{filename}

Status Update

Firestore status → “processing”

Frontend shows spinner

STAGE 3Chomper Parse36+ formats supported

Text Extraction

PDF, DOCX, PPTX, XLSX, CSV, HTML, TXT

Full document text

Chunking

Word-based splitting

1000 words, 100 overlap

Storage

Save to Cloud Storage

text_path + chunk_path

STAGE 4Parallel AI Analysisasyncio.gather()
Task A — Classification

Gemini classifies document type:

requirements_docmeeting_notesemail_threadtechnical_specproposalreport
Task B — Metadata Generation

summary — concise document overview

tags — content classification labels

topic_relevance — {topic: 0.0–1.0} scores

content_indicators — has_requirements, has_decisions, etc.

key_entities — stakeholders, features, decisions

STAGE 5Finalizestatus → complete

Store all metadata in Firestore → Update document_count on project → Set status to “complete” (or “failed” with error message) → Frontend detects change on next poll

Why This Pipeline Matters

The AI metadata generated in Stage 4 is what makes the BRD agent intelligent. When the agent calls list_project_documents(), it sees topic relevance scores and content indicators — not just filenames. When it calls search_documents_by_topic(), the pre-computed topic scores enable instant filtering without re-reading documents. The pipeline transforms dumb files into searchable, classified, AI-enriched records.

AI Metadata Structure

per document — domain-agnostic

Topic Relevance

infrastructure
0.9
security
0.7
compliance
0.4
budget
0.2

AI generates topics from content — not predefined

Content Indicators

has requirements
has decisions
has timelines
has budget info
has stakeholder input

Boolean flags for fast document selection

Key Entities

Stakeholders

Product TeamEngineeringLegal

Features

SSOAudit Logging

Decisions

Use AWSQ3 Launch