What I Built
GroundCrew is an automated fact-checking pipeline. Give it any text, and it:
- Extracts factual claims
- Searches for evidence (web or Wikipedia)
- Verifies each claim (SUPPORTS / REFUTES / NOT ENOUGH INFO)
- Generates a report with confidence scores and evidence
Built with LangGraph, fully typed with Pydantic, runs in batch mode for evaluations.
Quick Start:
git clone https://github.com/tsensei/GroundCrew.git
cd GroundCrew
# Setup instructions in README
Architecture
4-Stage LangGraph Pipeline:
- Extraction: Pulls factual statements from input text
- Search: Tavily web search or Wikipedia mode
- Verification: Compares claim against evidence
- Reporting: Markdown/JSON output with rationale
All nodes use structured outputs (Pydantic schemas), so no “stringly-typed” prompt glue. Type safety throughout the graph.
Results (FEVER Dataset, 100 Samples)
Tested on a FEVER subset with GPT-4o:
Configuration | Overall | SUPPORTS | REFUTES | NEI |
---|---|---|---|---|
Web Search | 71% | 88% | 82% | 42% |
Wikipedia-only | 72% | 91% | 88% | 36% |
Context: Specialized FEVER systems hit 85-90%+. For a weekend LLM pipeline, 72% is decent, but NOT ENOUGH INFO is clearly broken.
Where It Breaks
NOT ENOUGH INFO (42% accuracy) The model infers from partial evidence instead of abstaining. Teaching LLMs to say “I don’t know” is harder than getting them to support or refute.
Evidence Specificity Example: Claim says “founded by two men,” evidence lists two names but never explicitly states “two.” The verifier counts names and declares SUPPORTS. Technically wrong under FEVER guidelines.
Contradiction Edges Temporal qualifiers (“as of 2019…”) and entity disambiguation (same name, different entity) still cause issues.
Tech Stack
- Python with LangGraph
- Pydantic for structured outputs
- Tavily API for web search
- GPT-4o for reasoning
- FEVER dataset for evaluation
What’s Included
- Batch evaluation harness: Run 100s of samples with parallel workers
- Ablation configs: Test web vs Wikipedia-only modes
- Full documentation: Wiki covers setup, usage, architecture, and API reference
- Eval scripts:
evals/
directory has reproducible runs
Open Questions
I’m looking for feedback on:
NEI Detection: What works for making abstention stick? Routing strategies? NLI filters? Confidence thresholding?
Contradiction Handling: Lightweight approaches to catch “close but not entailed” evidence without heavy reranker stacks?
Eval Design: What would make you trust this style of system? Harder subsets? Human-in-the-loop checks?
Links
- Repository: github.com/tsensei/GroundCrew
- Evaluation scripts: evals/
- Documentation: Wiki
MIT licensed. If you’re working on fact-checking or have ideas for NEI detection, I’d love to hear them.