Enterprise RAG Pipeline for Document Intelligence

RAG
production-ml
NLP
architecture
financial-services
Designing and shipping a production RAG system that processes millions of regulatory and legal documents for a financial services client — architecture decisions, retrieval quality challenges, and what we learned.
Published

January 1, 2025

The problem

A financial services firm had 15 years of regulatory filings, internal policy documents, and legal agreements — roughly 4 million documents — stored in a document management system that was effectively unsearchable. Analysts spent 20–40% of their time locating precedents and cross-referencing policies.

The ask: build a system that lets analysts ask questions in natural language and get accurate, cited answers across the full corpus.


Constraints that shaped the architecture

Before writing a single line of code, the constraints told most of the story:

  • Data residency: all processing had to stay within Azure US regions (regulatory requirement)
  • No data in OpenAI training: required Azure OpenAI, not the public API
  • Latency: analysts expected search-like response times, not research-assistant times. Firm target: p95 < 6 seconds
  • Citation required: every answer had to cite source documents with enough metadata for an analyst to verify independently
  • Audit trail: query + retrieved context + generation had to be logged for compliance review

These constraints eliminated several otherwise attractive design choices (e.g., using an external hosted vector database, using Claude via Anthropic’s API directly, streaming everything through a serverless function).


Architecture overview

                    ┌─────────────┐
    Documents ──────▶  Ingestion  ├──────────┐
                    │  Pipeline   │          │
                    └─────────────┘          ▼
                                      ┌──────────────┐
                                      │   Neo4j      │  ◀── Entity graph
                                      │  (GraphRAG)  │       relationships
                                      └──────┬───────┘
                                             │
    User Query ──▶  Query Layer  ──────────▶ │
                    (FastAPI)          ┌──────┴───────┐
                         │            │   Qdrant     │  ◀── Dense vectors
                         │            │  (Vectors)   │
                         │            └──────────────┘
                         │
                         ▼
                   ┌──────────────┐
                   │  Re-ranker   │  (cross-encoder)
                   └──────┬───────┘
                          │
                          ▼
                   ┌──────────────┐
                   │ Azure OpenAI │  (GPT-4o)
                   │  Generation  │
                   └──────────────┘

Key components:

Component Technology Role
Ingestion orchestration Prefect Scheduled + event-driven pipeline runs
Document parsing Azure Document Intelligence Layout-aware extraction from PDFs
Chunking Custom semantic chunker Respects document structure
Embeddings Azure OpenAI text-embedding-3-large 3072-dim dense vectors
Vector store Qdrant (Azure VM) ANN search with metadata filtering
Graph store Neo4j (Azure VM) Entity and relationship traversal
Re-ranker BGE-Reranker-Large Cross-encoder precision layer
Generation Azure OpenAI GPT-4o Grounded generation with citations
API layer FastAPI + async Query orchestration
Observability Azure Monitor + custom logging Compliance audit trail

Retrieval design: why we used GraphRAG

The corpus had high entity density — regulatory documents that reference specific rules, entities, and cross-document relationships extensively. Pure vector search struggled with queries like “what does our GDPR policy say about data retention, and does it align with the EU AI Act requirements?” — a query that requires traversing a relationship graph, not just finding similar text.

GraphRAG allowed us to:

  1. Link entities across documents: a named regulation, an internal policy, and a legal agreement could be nodes with explicit GOVERNS, AMENDS, and SUPERSEDES relationships
  2. Answer relationship queries directly: some queries are fundamentally about connections, not content similarity
  3. Improve retrieval recall for named entities: exact entity lookup via the graph complemented embedding-based semantic search

The graph was built in parallel with the vector index during ingestion. Entity extraction used a fine-tuned NER model (spaCy + custom components) rather than prompting the LLM for extraction — significantly cheaper at scale and more consistent.


Chunking strategy

We tried three approaches before settling on a hierarchical semantic chunker:

  1. Fixed-size sliding window (baseline): fast to implement, poor quality on structured documents. Failed on tables, definitions, and numbered lists.
  2. Section-level chunking: better structure preservation, but sections varied from 200 to 8,000 tokens. Embedding quality degrades at extremes.
  3. Hierarchical semantic chunking (final): split documents into sections → sub-sections → paragraphs. Store parent-child relationships. Query returns leaf chunks; generation context includes parent for surrounding context.

This was the single highest-leverage quality improvement during the project.


Latency engineering

The original pipeline ran at ~14 seconds p95. Target was 6 seconds. Changes that moved the number:

Change Latency saved
Async parallel retrieval (vector + graph) −3.2s
Reduced re-ranker batch size (20 → 10 candidates) −1.8s
Response streaming for generation −4.0s perceived
Query embedding cache (top 500 query patterns) −0.8s
Result < 5s p95 actual, ~2s perceived

Streaming made the largest difference to user experience even though it didn’t reduce actual generation time. Analysts reported the system feeling “fast” once they saw tokens appearing within 1–2 seconds.


What we measured

Before launch, we built a 200-question golden dataset drawn from historical analyst queries with known answers. We tracked:

  • Retrieval recall@5: 91% (does the correct source document appear in top 5 retrieved chunks?)
  • Answer faithfulness: 94% (LLM-as-judge via GPT-4o comparing answer against retrieved context)
  • Answer relevance: 88% (LLM-as-judge: does the answer address the question?)
  • Latency p95: 4.8s

Post-launch, analyst feedback rated answer quality as “good” or “excellent” 82% of the time in the first 30 days, with a 15% increase in query volume week-over-week indicating adoption.


What we’d do differently

Entity extraction reliability: The NER pipeline worked well for common entity types but struggled with novel regulatory terminology. We’d invest more in the extraction evaluation framework earlier — we caught several systematic failures only because a senior analyst recognized the pattern.

Graph schema evolution: We underestimated how much the graph schema would need to evolve during development. More upfront investment in schema design and migration tooling would have saved time.

Semantic cache: We added the query cache post-launch. Should have been in the initial design — it reduced LLM costs by 23% with minimal complexity.


Technologies used

Python 3.12 · LangChain · Azure OpenAI (GPT-4o, text-embedding-3-large) · Qdrant · Neo4j · FastAPI · Prefect · spaCy · Azure Document Intelligence · Azure Monitor · Docker · Bicep (IaC)

Back to top