Building an Automated Fact-Checking Pipeline with LangGraph

What I Built

GroundCrew is an automated fact-checking pipeline. Give it any text, and it:

Extracts factual claims
Searches for evidence (web or Wikipedia)
Verifies each claim (SUPPORTS / REFUTES / NOT ENOUGH INFO)
Generates a report with confidence scores and evidence

Built with LangGraph, fully typed with Pydantic, runs in batch mode for evaluations.

Quick Start:

git clone https://github.com/tsensei/GroundCrew.git
cd GroundCrew
# Setup instructions in README

Architecture

4-Stage LangGraph Pipeline:

Extraction: Pulls factual statements from input text
Search: Tavily web search or Wikipedia mode
Verification: Compares claim against evidence
Reporting: Markdown/JSON output with rationale

All nodes use structured outputs (Pydantic schemas), so no “stringly-typed” prompt glue. Type safety throughout the graph.

Results (FEVER Dataset, 100 Samples)

Tested on a FEVER subset with GPT-4o:

Configuration	Overall	SUPPORTS	REFUTES	NEI
Web Search	71%	88%	82%	42%
Wikipedia-only	72%	91%	88%	36%

Context: Specialized FEVER systems hit 85-90%+. For a weekend LLM pipeline, 72% is decent, but NOT ENOUGH INFO is clearly broken.

Where It Breaks

NOT ENOUGH INFO (42% accuracy) The model infers from partial evidence instead of abstaining. Teaching LLMs to say “I don’t know” is harder than getting them to support or refute.

Evidence Specificity Example: Claim says “founded by two men,” evidence lists two names but never explicitly states “two.” The verifier counts names and declares SUPPORTS. Technically wrong under FEVER guidelines.

Contradiction Edges Temporal qualifiers (“as of 2019…”) and entity disambiguation (same name, different entity) still cause issues.

Tech Stack

Python with LangGraph
Pydantic for structured outputs
Tavily API for web search
GPT-4o for reasoning
FEVER dataset for evaluation

What’s Included

Batch evaluation harness: Run 100s of samples with parallel workers
Ablation configs: Test web vs Wikipedia-only modes
Full documentation: Wiki covers setup, usage, architecture, and API reference
Eval scripts: evals/ directory has reproducible runs

Open Questions

I’m looking for feedback on:

NEI Detection: What works for making abstention stick? Routing strategies? NLI filters? Confidence thresholding?

Contradiction Handling: Lightweight approaches to catch “close but not entailed” evidence without heavy reranker stacks?

Eval Design: What would make you trust this style of system? Harder subsets? Human-in-the-loop checks?