Skip to content

Building an Automated Fact-Checking Pipeline with LangGraph

Published: at 10:00 AM

What I Built

GroundCrew is an automated fact-checking pipeline. Give it any text, and it:

  1. Extracts factual claims
  2. Searches for evidence (web or Wikipedia)
  3. Verifies each claim (SUPPORTS / REFUTES / NOT ENOUGH INFO)
  4. Generates a report with confidence scores and evidence

Built with LangGraph, fully typed with Pydantic, runs in batch mode for evaluations.

Quick Start:

git clone https://github.com/tsensei/GroundCrew.git
cd GroundCrew
# Setup instructions in README

Architecture

4-Stage LangGraph Pipeline:

All nodes use structured outputs (Pydantic schemas), so no “stringly-typed” prompt glue. Type safety throughout the graph.

Results (FEVER Dataset, 100 Samples)

Tested on a FEVER subset with GPT-4o:

ConfigurationOverallSUPPORTSREFUTESNEI
Web Search71%88%82%42%
Wikipedia-only72%91%88%36%

Context: Specialized FEVER systems hit 85-90%+. For a weekend LLM pipeline, 72% is decent, but NOT ENOUGH INFO is clearly broken.

Where It Breaks

NOT ENOUGH INFO (42% accuracy) The model infers from partial evidence instead of abstaining. Teaching LLMs to say “I don’t know” is harder than getting them to support or refute.

Evidence Specificity Example: Claim says “founded by two men,” evidence lists two names but never explicitly states “two.” The verifier counts names and declares SUPPORTS. Technically wrong under FEVER guidelines.

Contradiction Edges Temporal qualifiers (“as of 2019…”) and entity disambiguation (same name, different entity) still cause issues.

Tech Stack

What’s Included

Open Questions

I’m looking for feedback on:

NEI Detection: What works for making abstention stick? Routing strategies? NLI filters? Confidence thresholding?

Contradiction Handling: Lightweight approaches to catch “close but not entailed” evidence without heavy reranker stacks?

Eval Design: What would make you trust this style of system? Harder subsets? Human-in-the-loop checks?

MIT licensed. If you’re working on fact-checking or have ideas for NEI detection, I’d love to hear them.


Previous Post
Tracking Discord Standups So I Don't Have To