I run an AI-based fact-checking platform and I refuse to let the LLM produce the verdict. Here’s why.

After a year building a production fact-checking system, the single most counter-intuitive design decision I keep defending is this: the LLM in our pipeline never produces a numeric score, never produces a true/false verdict, never produces anything that gets surfaced to the user as a judgment. The LLM extracts structured factual flags from source material. A deterministic Python scoring layer turns those flags into a verdict tier. That’s it.

This is uncomfortable to explain because everyone, including potential customers, assumes that “AI-powered fact-checking” means the AI gives the verdict. The pitch would be cleaner if I let the LLM say “this claim is 73% likely false” and called it a day. But here’s why I won’t.

LLM scoring instability is real and underdocumented. Run the same prompt with the same model on the same claim five times and you get verdicts ranging from “mostly false” to “partially true” depending on sampling temperature and the order in which sources appear in the context window. This is fine for creative writing. It is catastrophic when a journalist needs to defend their decision to publish or kill a story. “Our scoring varies by 30% based on stochastic sampling” is not a sentence you can put in front of an editorial board.

LLM verdicts are also unauditable. When the LLM says “false,” there is no way to point at which sources mattered, which signals pushed the score, which weights applied. The reasoning chain is opaque even with chain-of-thought prompting, because the chain itself is generated probabilistically and may rationalize after the fact rather than reflect the actual computation. Journalists I’ve spoken with don’t want a confident AI verdict. They want a verifiable verdict.

Those are different things.

The split I landed on is this. The LLM is good at extraction. Given a source document and a claim, it can flag “this source confirms X,” “this source contradicts Y,” “this source is silent on Z” with reasonable consistency. These flags are structured (booleans or short categorical labels), not numeric scores. The Python scoring layer takes those flags, applies pre-defined weights based on source credibility (independently computed from MBFC, NewsGuard, RSF, Wikidata cross-referencing), and produces a verdict tier. The weights are documented. The scoring rules are deterministic. The same input always produces the same output. Anyone can audit which sources contributed how much to a given verdict.

The trade-off is real. The system is less flexible than letting the LLM “reason” freely. Edge cases where the claim doesn’t fit the categorical extraction schema sometimes produce awkward outputs. The scoring weights themselves are a design choice that embeds assumptions, and changing them requires deliberate engineering rather than retraining. But these are honest constraints, visible to the user, rather than hidden non-determinism dressed up as objectivity.
I think this matters beyond fact-checking. Any high-stakes domain where AI is being used to produce decisions (credit scoring, hiring filters, medical triage, legal triage) faces the same fundamental choice: let the LLM produce the score and hope nobody notices the stochasticity, or constrain the LLM to extraction and put the decision logic somewhere auditable. The industry mostly does the first thing because it ships faster. I think the second approach is the only one defensible long-term, especially under the EU AI Act which is going to start requiring decision explainability in production systems within the next 18 months.

Curious if anyone here is building similar deterministic-on-top-of-LLM architectures in other domains, or if there are counter-arguments I’m missing. The “let the LLM decide” school has obvious advantages I’m probably under-weighting.

submitted by /u/jonathancheckwise
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top