Enterprise RAG Pipeline for Document Intelligence

RAG

production-ml

NLP

architecture

financial-services

Designing and shipping a production RAG system that processes millions of regulatory and legal documents for a financial services client — architecture decisions, retrieval quality challenges, and what we learned.

Published

January 1, 2025

The problem

A financial services firm had 15 years of regulatory filings, internal policy documents, and legal agreements — roughly 4 million documents — stored in a document management system that was effectively unsearchable. Analysts spent 20–40% of their time locating precedents and cross-referencing policies.

The ask: build a system that lets analysts ask questions in natural language and get accurate, cited answers across the full corpus.

Constraints that shaped the architecture

Before writing a single line of code, the constraints told most of the story:

Data residency: all processing had to stay within Azure US regions (regulatory requirement)
No data in OpenAI training: required Azure OpenAI, not the public API
Latency: analysts expected search-like response times, not research-assistant times. Firm target: p95 < 6 seconds
Citation required: every answer had to cite source documents with enough metadata for an analyst to verify independently
Audit trail: query + retrieved context + generation had to be logged for compliance review

These constraints eliminated several otherwise attractive design choices (e.g., using an external hosted vector database, using Claude via Anthropic’s API directly, streaming everything through a serverless function).

Architecture overview

                    ┌─────────────┐
    Documents ──────▶  Ingestion  ├──────────┐
                    │  Pipeline   │          │
                    └─────────────┘          ▼
                                      ┌──────────────┐
                                      │   Neo4j      │  ◀── Entity graph
                                      │  (GraphRAG)  │       relationships
                                      └──────┬───────┘
                                             │
    User Query ──▶  Query Layer  ──────────▶ │
                    (FastAPI)          ┌──────┴───────┐
                         │            │   Qdrant     │  ◀── Dense vectors
                         │            │  (Vectors)   │
                         │            └──────────────┘
                         │
                         ▼
                   ┌──────────────┐
                   │  Re-ranker   │  (cross-encoder)
                   └──────┬───────┘
                          │
                          ▼
                   ┌──────────────┐
                   │ Azure OpenAI │  (GPT-4o)
                   │  Generation  │
                   └──────────────┘

Key components:

Component	Technology	Role
Ingestion orchestration	Prefect	Scheduled + event-driven pipeline runs
Document parsing	Azure Document Intelligence	Layout-aware extraction from PDFs
Chunking	Custom semantic chunker	Respects document structure
Embeddings	Azure OpenAI `text-embedding-3-large`	3072-dim dense vectors
Vector store	Qdrant (Azure VM)	ANN search with metadata filtering
Graph store	Neo4j (Azure VM)	Entity and relationship traversal
Re-ranker	BGE-Reranker-Large	Cross-encoder precision layer
Generation	Azure OpenAI GPT-4o	Grounded generation with citations
API layer	FastAPI + async	Query orchestration
Observability	Azure Monitor + custom logging	Compliance audit trail

Retrieval design: why we used GraphRAG

The corpus had high entity density — regulatory documents that reference specific rules, entities, and cross-document relationships extensively. Pure vector search struggled with queries like “what does our GDPR policy say about data retention, and does it align with the EU AI Act requirements?” — a query that requires traversing a relationship graph, not just finding similar text.

GraphRAG allowed us to:

Link entities across documents: a named regulation, an internal policy, and a legal agreement could be nodes with explicit GOVERNS, AMENDS, and SUPERSEDES relationships
Answer relationship queries directly: some queries are fundamentally about connections, not content similarity
Improve retrieval recall for named entities: exact entity lookup via the graph complemented embedding-based semantic search

The graph was built in parallel with the vector index during ingestion. Entity extraction used a fine-tuned NER model (spaCy + custom components) rather than prompting the LLM for extraction — significantly cheaper at scale and more consistent.

Chunking strategy

We tried three approaches before settling on a hierarchical semantic chunker:

Fixed-size sliding window (baseline): fast to implement, poor quality on structured documents. Failed on tables, definitions, and numbered lists.
Section-level chunking: better structure preservation, but sections varied from 200 to 8,000 tokens. Embedding quality degrades at extremes.
Hierarchical semantic chunking (final): split documents into sections → sub-sections → paragraphs. Store parent-child relationships. Query returns leaf chunks; generation context includes parent for surrounding context.

This was the single highest-leverage quality improvement during the project.

Latency engineering

The original pipeline ran at ~14 seconds p95. Target was 6 seconds. Changes that moved the number:

Change	Latency saved
Async parallel retrieval (vector + graph)	−3.2s
Reduced re-ranker batch size (20 → 10 candidates)	−1.8s
Response streaming for generation	−4.0s perceived
Query embedding cache (top 500 query patterns)	−0.8s
Result	< 5s p95 actual, ~2s perceived

Streaming made the largest difference to user experience even though it didn’t reduce actual generation time. Analysts reported the system feeling “fast” once they saw tokens appearing within 1–2 seconds.

What we measured

Before launch, we built a 200-question golden dataset drawn from historical analyst queries with known answers. We tracked:

Retrieval recall@5: 91% (does the correct source document appear in top 5 retrieved chunks?)
Answer faithfulness: 94% (LLM-as-judge via GPT-4o comparing answer against retrieved context)
Answer relevance: 88% (LLM-as-judge: does the answer address the question?)
Latency p95: 4.8s

Post-launch, analyst feedback rated answer quality as “good” or “excellent” 82% of the time in the first 30 days, with a 15% increase in query volume week-over-week indicating adoption.

What we’d do differently

Entity extraction reliability: The NER pipeline worked well for common entity types but struggled with novel regulatory terminology. We’d invest more in the extraction evaluation framework earlier — we caught several systematic failures only because a senior analyst recognized the pattern.

Graph schema evolution: We underestimated how much the graph schema would need to evolve during development. More upfront investment in schema design and migration tooling would have saved time.

Semantic cache: We added the query cache post-launch. Should have been in the initial design — it reduced LLM costs by 23% with minimal complexity.

Technologies used

Python 3.12 · LangChain · Azure OpenAI (GPT-4o, text-embedding-3-large) · Qdrant · Neo4j · FastAPI · Prefect · spaCy · Azure Document Intelligence · Azure Monitor · Docker · Bicep (IaC)