Enterprise RAG Pipeline for Document Intelligence
The problem
A financial services firm had 15 years of regulatory filings, internal policy documents, and legal agreements — roughly 4 million documents — stored in a document management system that was effectively unsearchable. Analysts spent 20–40% of their time locating precedents and cross-referencing policies.
The ask: build a system that lets analysts ask questions in natural language and get accurate, cited answers across the full corpus.
Constraints that shaped the architecture
Before writing a single line of code, the constraints told most of the story:
- Data residency: all processing had to stay within Azure US regions (regulatory requirement)
- No data in OpenAI training: required Azure OpenAI, not the public API
- Latency: analysts expected search-like response times, not research-assistant times. Firm target: p95 < 6 seconds
- Citation required: every answer had to cite source documents with enough metadata for an analyst to verify independently
- Audit trail: query + retrieved context + generation had to be logged for compliance review
These constraints eliminated several otherwise attractive design choices (e.g., using an external hosted vector database, using Claude via Anthropic’s API directly, streaming everything through a serverless function).
Architecture overview
┌─────────────┐
Documents ──────▶ Ingestion ├──────────┐
│ Pipeline │ │
└─────────────┘ ▼
┌──────────────┐
│ Neo4j │ ◀── Entity graph
│ (GraphRAG) │ relationships
└──────┬───────┘
│
User Query ──▶ Query Layer ──────────▶ │
(FastAPI) ┌──────┴───────┐
│ │ Qdrant │ ◀── Dense vectors
│ │ (Vectors) │
│ └──────────────┘
│
▼
┌──────────────┐
│ Re-ranker │ (cross-encoder)
└──────┬───────┘
│
▼
┌──────────────┐
│ Azure OpenAI │ (GPT-4o)
│ Generation │
└──────────────┘
Key components:
| Component | Technology | Role |
|---|---|---|
| Ingestion orchestration | Prefect | Scheduled + event-driven pipeline runs |
| Document parsing | Azure Document Intelligence | Layout-aware extraction from PDFs |
| Chunking | Custom semantic chunker | Respects document structure |
| Embeddings | Azure OpenAI text-embedding-3-large |
3072-dim dense vectors |
| Vector store | Qdrant (Azure VM) | ANN search with metadata filtering |
| Graph store | Neo4j (Azure VM) | Entity and relationship traversal |
| Re-ranker | BGE-Reranker-Large | Cross-encoder precision layer |
| Generation | Azure OpenAI GPT-4o | Grounded generation with citations |
| API layer | FastAPI + async | Query orchestration |
| Observability | Azure Monitor + custom logging | Compliance audit trail |
Retrieval design: why we used GraphRAG
The corpus had high entity density — regulatory documents that reference specific rules, entities, and cross-document relationships extensively. Pure vector search struggled with queries like “what does our GDPR policy say about data retention, and does it align with the EU AI Act requirements?” — a query that requires traversing a relationship graph, not just finding similar text.
GraphRAG allowed us to:
- Link entities across documents: a named regulation, an internal policy, and a legal agreement could be nodes with explicit
GOVERNS,AMENDS, andSUPERSEDESrelationships - Answer relationship queries directly: some queries are fundamentally about connections, not content similarity
- Improve retrieval recall for named entities: exact entity lookup via the graph complemented embedding-based semantic search
The graph was built in parallel with the vector index during ingestion. Entity extraction used a fine-tuned NER model (spaCy + custom components) rather than prompting the LLM for extraction — significantly cheaper at scale and more consistent.
Chunking strategy
We tried three approaches before settling on a hierarchical semantic chunker:
- Fixed-size sliding window (baseline): fast to implement, poor quality on structured documents. Failed on tables, definitions, and numbered lists.
- Section-level chunking: better structure preservation, but sections varied from 200 to 8,000 tokens. Embedding quality degrades at extremes.
- Hierarchical semantic chunking (final): split documents into sections → sub-sections → paragraphs. Store parent-child relationships. Query returns leaf chunks; generation context includes parent for surrounding context.
This was the single highest-leverage quality improvement during the project.
Latency engineering
The original pipeline ran at ~14 seconds p95. Target was 6 seconds. Changes that moved the number:
| Change | Latency saved |
|---|---|
| Async parallel retrieval (vector + graph) | −3.2s |
| Reduced re-ranker batch size (20 → 10 candidates) | −1.8s |
| Response streaming for generation | −4.0s perceived |
| Query embedding cache (top 500 query patterns) | −0.8s |
| Result | < 5s p95 actual, ~2s perceived |
Streaming made the largest difference to user experience even though it didn’t reduce actual generation time. Analysts reported the system feeling “fast” once they saw tokens appearing within 1–2 seconds.
What we measured
Before launch, we built a 200-question golden dataset drawn from historical analyst queries with known answers. We tracked:
- Retrieval recall@5: 91% (does the correct source document appear in top 5 retrieved chunks?)
- Answer faithfulness: 94% (LLM-as-judge via GPT-4o comparing answer against retrieved context)
- Answer relevance: 88% (LLM-as-judge: does the answer address the question?)
- Latency p95: 4.8s
Post-launch, analyst feedback rated answer quality as “good” or “excellent” 82% of the time in the first 30 days, with a 15% increase in query volume week-over-week indicating adoption.
What we’d do differently
Entity extraction reliability: The NER pipeline worked well for common entity types but struggled with novel regulatory terminology. We’d invest more in the extraction evaluation framework earlier — we caught several systematic failures only because a senior analyst recognized the pattern.
Graph schema evolution: We underestimated how much the graph schema would need to evolve during development. More upfront investment in schema design and migration tooling would have saved time.
Semantic cache: We added the query cache post-launch. Should have been in the initial design — it reduced LLM costs by 23% with minimal complexity.
Technologies used
Python 3.12 · LangChain · Azure OpenAI (GPT-4o, text-embedding-3-large) · Qdrant · Neo4j · FastAPI · Prefect · spaCy · Azure Document Intelligence · Azure Monitor · Docker · Bicep (IaC)