riskfolio-graphrag-agent review

Overall verdict

This repo is better than a toy demo, but it is not yet a credible senior-level GraphRAG package in the way the README implies. The architecture is organized, the module boundaries are thoughtful, and there is real effort around eval, tracing, and graph modeling. But the core retrieval and evaluation methods are still dominated by heuristics, tiny benchmarks, deterministic fallbacks, and metrics that are much easier to game than to trust. The main risk is not that the repo is incompetent; it is that it presents itself as more rigorous, more “production-ready,” and more validated than the actual evidence supports.

What is actually solid

  • Clear package decomposition. The layering across ingestion, graph, retrieval, agent, eval, and app is coherent and more disciplined than many portfolio repos.
  • Good architectural self-awareness. docs/architecture_module_map.md is unusually explicit about boundaries and non-goals.
  • Deterministic fallbacks are a sensible design choice for a demo repo. That makes local development and testing easier.
  • The graph builder has some legitimate engineering substance: idempotent upserts, ontology-style labels/relations, and bounded graph subgraph extraction.
  • The repo includes tests, CI surface area, and reporting artifacts instead of being just a notebook and README.
  • Safety framing around NL-to-Cypher and bounded graph expansion is directionally correct.

What looks weak / juvenile / risky

  • The README overclaims. Phrases like “production-ready for enterprise KG/RAG/agentic AI deployment” are not supported by the actual retrieval, evaluation, or robustness evidence.
  • “Dense retrieval” is often hash embeddings, not meaningful semantic embeddings. That is acceptable as a fallback, but not as evidence of strong dense retrieval quality.
  • The Neo4j “vector” fallback is not a vector backend at all; it is lexical sparse retrieval over chunk text. Naming it Neo4jChunkVectorStore is misleading.
  • Hybrid ranking is fixed-weight score mixing plus graph-count boosts. That is fine for a prototype, but weak as a serious retrieval method unless justified by proper ablations.
  • Query routing is mostly regex/rule routing plus prototype similarity over hash embeddings. That is a heuristic intent switch, not a robust router.
  • Evaluation is the biggest credibility problem. The sample set is tiny, hand-authored, domain-narrow, and the metrics are mostly heuristic token-overlap proxies with “RAGAS-style” naming that risks sounding more rigorous than they are.
  • Perfect or near-perfect metrics in eval_results.json are not believable as strong evidence because the benchmark is so small and the metrics are structurally forgiving. answer_faithfulness=1.0, multi_hop_accuracy=1.0, ER metrics = 1.0, and link prediction metrics = 1.0 are exactly the sort of numbers that make experienced reviewers suspicious.
  • The ablation artifact undercuts the GraphRAG story: sparse wins, graph is worse, and hybrid_rerank is materially worse on the benchmark in benchmarks/retrieval_ablation_results.md. That does not mean the repo is bad, but it does mean the repo currently does not demonstrate that the graph machinery is earning its keep.
  • Several tests verify stubs, compatibility helpers, or type-level behavior rather than meaningful end-to-end correctness. That inflates the appearance of maturity.
  • The “agent” layer is fairly thin. The plan-retrieve-reason-verify flow is structurally present, but the reasoning and verification are simple heuristics unless an external LLM is injected.

Most important technical gaps

  1. Evaluation rigor is insufficient for the claims. Five samples is nowhere near enough, and the metrics are mostly overlap heuristics. This is the single biggest blocker to credibility.
  2. The graph contribution is not convincingly demonstrated. Your own ablation suggests sparse retrieval is best and graph/hybrid do not clearly improve outcomes.
  3. Method naming and framing are too generous. Calling heuristic overlap metrics “RAGAS-style,” calling lexical fallback a vector store, and claiming production readiness weakens trust.
  4. Dense retrieval evidence is weak. If the default or common path uses hash embeddings, then the repo is not demonstrating serious semantic retrieval quality.
  5. The agentic story is overstated. The workflow orchestration exists, but the reasoning, verification, and self-correction are still shallow and mostly procedural.
  6. Test suite signal is mixed. There are useful tests, but too many of them validate fallback behavior or structure rather than method quality and failure modes.

Concrete recommendations

  • Rewrite the README to be more honest:
    • Call it a GraphRAG / hybrid retrieval demo and evaluation scaffold.
    • Remove “production-ready” language unless you add much stronger operational and evaluation evidence.
    • Explicitly state which components are heuristic and which are serious implementations.
  • Rename misleading abstractions:
    • Neo4jChunkVectorStore should be renamed to something like Neo4jLexicalStore or Neo4jSparseFallbackStore.
    • Avoid language that implies true semantic vector retrieval when the backend is lexical.
  • Upgrade the evaluation harness materially:
    • Build at least 50–100 evaluation questions spanning definition lookup, API lookup, constraint questions, estimator questions, comparison questions, and multi-hop graph questions.
    • Split them by difficulty and retrieval type.
    • Add adversarial / negative examples where graph expansion should not help.
    • Add query-set versioning and make the benchmark corpus explicit.
  • Separate heuristic metrics from benchmark-grade metrics:
    • Keep the current metrics if useful, but label them plainly as heuristic overlap metrics.
    • Add human-reviewed adjudication on a subset.
    • If using LLM-as-judge, isolate it as a separate optional metric with clear caveats.
  • Prove the graph helps or simplify it:
    • Run controlled ablations on a larger benchmark and report where graph retrieval improves recall/precision versus sparse/dense.
    • If graph only helps on a narrow slice of query types, say that explicitly.
    • If graph does not help enough, reduce the GraphRAG claims and position it as an exploratory architecture.
  • Improve retrieval method quality:
    • Use a real embedding model by default for serious runs.
    • Add reciprocal-rank fusion or another better-founded hybrid combiner before considering learned rerankers.
    • Normalize scores properly across dense/sparse channels rather than mixing raw scores with fixed constants.
  • Toughen verification:
    • Current answer verification is token-overlap grounding. That is too weak.
    • Verify answer claims against cited contexts at the sentence or claim level.
    • Penalize unsupported details and cross-context stitching errors.
  • Add real failure-mode analysis:
    • Show example misses, false positives, routing mistakes, and graph-noise cases.
    • Include a short “known limitations” section in the README.
  • Improve tests where it matters:
    • Add retrieval-regression tests with expected top-k ordering on a fixed mini corpus.
    • Add router tests on ambiguous queries.
    • Add graph-construction tests for deduplication, alias collisions, and edge integrity.
    • Add eval tests that ensure metrics are not trivially maxing out on degenerate contexts.

Credibility upgrade plan

Minimum viable upgrade

If you want the fastest path to making this repo look more senior-level without rebuilding everything: 1. Tone down the README claims. 2. Rename misleading components. 3. Expand the eval set from 5 to at least 25 carefully designed samples. 4. Publish honest ablations showing where sparse beats graph and where graph helps. 5. Add a limitations section.

That alone would substantially improve reviewer trust.

Stronger senior-level version

If you want this repo to read as genuinely strong: 1. Build a real benchmark with broader query coverage and versioned samples. 2. Demonstrate graph value on clearly defined query classes rather than overall hand-wavy “better context.” 3. Replace or clearly segregate heuristic metrics from more defensible evaluation. 4. Use a real embedding model in benchmark runs. 5. Add a retrieval error analysis document with concrete failures and design responses. 6. Reduce complexity where it is decorative and strengthen the evidence where it matters.

Bottom line

The repo shows real engineering effort and decent architectural instincts. That part is not juvenile. The juvenile part is mainly the gap between sophistication of presentation and rigor of validation. Right now it reads like an ambitious prototype packaged as a mature system. If you narrow the claims, strengthen evaluation, and prove where the graph actually helps, it can become a much more credible portfolio artifact.

Back to top