Architecture

System Architecture

How Sybil transforms raw documents into structured Business Requirements Documents — from infrastructure to AI agent orchestration to data preprocessing.

01 Infrastructure 02 AI Agent Flow 03 Data Preprocessing 04 Backend Architecture 05 Document Pipeline

Infrastructure

Cloud-native architecture on Google Cloud Platform with Vercel edge delivery and Terraform-managed infrastructure.

Client Tier

Next.js 14Vercel CDN + Edge

JWT AuthHS256 + bcrypt

FastAPICloud Run

Services Tier

FirestoreNoSQL Database

Cloud StorageDocument Files

Gemini 2.5 ProAI Engine

Artifact RegistryDocker Images

Infrastructure as Code

TerraformCloud Run + Registry

deploy.shBuild + Push + Apply

Next.js 14TypeScriptFastAPIPython 3.11FirestoreCloud StorageGemini 2.5 ProTerraformCloud RunVercelJWTshadcn/uiTailwind CSSZustandFramer Motion

AI Agent Flow

Fully agentic BRD generation and natural language editing — the AI plans its own workflow, reads documents, and writes structured requirements.

BRD Generation

fully agentic — max 30 iterations

User clicks “Generate BRD”

POST /brds/generate — fire-and-forget background task, returns 202 Accepted

AI discovers documents

Calls list_project_documents() — gets doc list with AI metadata (summaries, tags, topics)

real tool

AI reads full document text

Calls get_full_document_text() — reads entire document, not chunks. No RAG = no context loss.

real tool

AI cross-references documents

Calls search_documents_by_topic() and search_documents_by_content()

real tool

AI writes 13 BRD sections

Calls submit_brd_section() ×13 — virtual tool (intercepted, not executed). Each section includes markdown content + citations.

executive_summarybusiness_objectivesstakeholdersfunctional_requirementsnon_functional_requirementsassumptionssuccess_metricstimelineproject_backgroundproject_scopedependenciesriskscost_benefit

virtual tool

AI submits conflict + sentiment analysis

Calls submit_analysis() — detects contradictions across documents, extracts stakeholder sentiment, lists key concerns

virtual tool

Backend assembles BRD

Collects all virtual tool outputs → builds BRD model → stores in Firestore. Frontend polls and detects new BRD.

Why Virtual Tools?

Gemini API cannot combine tools (function calling) with response_mime_type: "application/json" (structured output) in the same request. Our workaround: define virtual tools that the AI “calls” — but the backend intercepts the function call arguments as structured data instead of executing them. This gives us both agentic tool-calling AND structured output in one pipeline.

Natural Language Editing

unified chat — max 8 iterations

User Interaction

Select text in BRD viewer

“Make this more concise”

Accept & Replace — or Iterate

Backend Processing

3-Layer Security

Pydantic → Regex injection scan → Defensive prompt wrapping

AI Self-Classification

AI calls submit_response(content, type)

refinementanswergeneration

Frontend Response

refinement/generation → “Accept & Replace” bar | answer → plain chat message

Data Preprocessing

Multi-tier NLP funnel that filters 517K Enron emails down to curated project datasets using heuristic scoring and semantic embeddings.

TIER 0Parse & Stream517,401 emails

enron_loader.py

Stream CSV in 5000-row batches

Parse RFC 822 headers + body

Deduplication

Key: subject + sender + date

Normalize Re:/FW: prefixes

Parallel

multiprocessing.Pool

CPU-bound parsing distributed

TIER 1Heuristic Filter517K → 78K (15%)

Positive Signals

+0.30BRD keywords in body

+0.20BRD keywords in subject

+0.15Targeted (1-10 recipients)

+0.15Substantial body (50-500 words)

+0.10Action language (?, "please review")

+0.10Good folder (inbox, sent)

Negative Signals

-0.30Noise keywords (lunch, birthday)

-0.20Noise subject patterns (FW: FW:)

-0.20Mass email (20+ recipients)

-0.15Noise folder (deleted_items, spam)

-0.10Trivially short (<15 words)

47 BRD keywords50+ generic subjects12 newsletter patterns9 junk folders

TIER 2Embedding Filter78K → 2K (top 2.5%)

Gemini text-embedding-004

768-dim vectors for all emails + 10 seed queries

Cosine Similarity

Each email scored against all BRD seed queries

Combined score: 0.3 × heuristic + 0.7 × embedding

Cost: ~$0.10 for 50K emails • Speed: ~5 min with batching + concurrency

TIER 3Export & Upload→ Sybil project

Export filtered emails as .txt files → Upload to Sybil API in batches (5 files/batch, 2s delay) → Chomper parses → Gemini classifies → Ready for BRD generation

Auto-Discovery

eda_discover.py — finds project threads automatically

Thread Scoring Formula

score = email_count × capped_senders × log2(avg_words)
        × project_indicator_bonus (10x)
        × brd_signal_density_bonus (1-4x)
        × blast_email_penalty (0.5x)

Output per Project

Rank, name, discovery score, email count, unique senders, extracted keywords (TF, no IDF), 5 auto-generated seed queries for embedding filter

ML Techniques Used

Technique	Location	Purpose
Gemini text-embedding-004	embedding_filter.py	Semantic email representation (768-dim vectors)
Cosine similarity	embedding_filter.py	Relevance scoring vs BRD seed queries
Term frequency (TF)	eda_discover.py	Keyword extraction (no IDF, stopwords removed)
Weighted feature scoring	heuristic_filter.py	Rule-based relevance classification
Hybrid ranking	curate_project.py	0.3 heuristic + 0.7 embedding combination

No custom models were trained. The pipeline uses pre-trained Gemini embeddings combined with hand-crafted heuristic features — standard NLP practice when labeled training data is unavailable.

Backend Architecture

Clean layered architecture — routes handle HTTP, services own business logic, models enforce schemas. Fully async with fire-and-forget background processing.

HTTP Layer

/authToken

/projectsCRUD

/documentsUpload

/brdsGenerate

/deletionsAsync

Service Layer — Business Logic

DocumentPipeline orchestration

BRD GenerationAgentic loop

Text RefinementUnified chat

FirestoreDatabase CRUD

StorageCloud Storage ops

GeminiAI API wrapper

AuthJWT + users

AI ServiceUtility functions

AgentTool executor

Data & External APIs

Pydantic ModelsType-safe schemas

FirestoreAsync client

Cloud StorageGCS bucket

Gemini APIgoogle-genai SDK

Async-First Design

Every I/O operation is async — Firestore AsyncClient, Cloud Storage uploads, Gemini API calls. Heavy processing (document parsing, BRD generation) runs as BackgroundTasks — the API returns 202 Accepted immediately and the frontend polls for completion. Zero blocking in request handlers.

API Endpoints

RESTful — /projects/{id}/resource pattern

Method	Endpoint	Purpose	Pattern
POST	/auth/token	Authenticate user	sync
POST	/projects	Create project	sync
GET	/projects/{id}	Get project details	sync
POST	/projects/{id}/documents/upload	Upload document	background
GET	/projects/{id}/documents	List documents	sync
POST	/projects/{id}/brds/generate	Generate BRD	background
GET	/projects/{id}/brds	List BRDs	sync
POST	/projects/{id}/brds/{brd_id}/chat	Chat / Refine BRD	sync
POST	/deletions/preview	Preview cascade delete	sync
POST	/deletions/confirm	Execute deletion	background

Document Pipeline

5-stage async pipeline — upload, store, parse, analyze in parallel, finalize. Transforms raw files into AI-enriched, searchable document records.

STAGE 1Create Recordreturns 202 immediately

ID Generation

Generate unique doc_id

Prefix-based (doc_*)

Firestore Write

Create document record

status: “uploading”

BackgroundTask

Remaining stages run async

Client polls GET for status

STAGE 2Cloud Storage UploadGCS bucket

File Storage

Write raw bytes to Cloud Storage

Path: projects/{id}/documents/{filename}

Status Update

Firestore status → “processing”

Frontend shows spinner

STAGE 3Chomper Parse36+ formats supported

Text Extraction

PDF, DOCX, PPTX, XLSX, CSV, HTML, TXT

Full document text

Chunking

Word-based splitting

1000 words, 100 overlap

Storage

Save to Cloud Storage

text_path + chunk_path

STAGE 4Parallel AI Analysisasyncio.gather()

Task A — Classification

Gemini classifies document type:

requirements_docmeeting_notesemail_threadtechnical_specproposalreport

Task B — Metadata Generation

summary — concise document overview

tags — content classification labels

topic_relevance — {topic: 0.0–1.0} scores

content_indicators — has_requirements, has_decisions, etc.

key_entities — stakeholders, features, decisions

STAGE 5Finalizestatus → complete

Store all metadata in Firestore → Update document_count on project → Set status to “complete” (or “failed” with error message) → Frontend detects change on next poll

Why This Pipeline Matters

The AI metadata generated in Stage 4 is what makes the BRD agent intelligent. When the agent calls list_project_documents(), it sees topic relevance scores and content indicators — not just filenames. When it calls search_documents_by_topic(), the pre-computed topic scores enable instant filtering without re-reading documents. The pipeline transforms dumb files into searchable, classified, AI-enriched records.

AI Metadata Structure

per document — domain-agnostic

Topic Relevance

infrastructure

0.9

security

0.7

compliance

0.4

budget

0.2

AI generates topics from content — not predefined

Content Indicators

■has requirements

■has decisions

□has timelines

□has budget info

■has stakeholder input

Boolean flags for fast document selection

Key Entities

Stakeholders

Product TeamEngineeringLegal

Features

SSOAudit Logging

Decisions

Use AWSQ3 Launch