Yin, Yang, and the LLM: Engineering Reliability into AI Code Scanning

How Statistical Quality Control, Not Prompt Engineering, Reduced My AI Accuracy Problems

The tug of war between agents built for precision and recall

In the rush to integrate Large Language Models (LLMs) into cybersecurity, we’ve hit a wall: Noise. Most AI-driven scanners either miss real issues or flood developers with “hallucinated” findings that look plausible but have no basis in reality.

After some code review and scanning sessions with just a system prompt that went horribly wrong, I decided I needed a new approach. Out of those failures I ended up with Veritas (GitHub), a multi-agent security pipeline, as a proof of concept project to assist developers in manual code reviews by giving them a guide for the needles in the haystack. I didn’t approach it with a “prompt engineering” perspective, I approached it as a product security engineer with a M.S. in Industrial Engineering Management (statistical quality and process control background).

The path that seemed to make Veritas more reliable wasn’t finding the “perfect” prompt, using forceful language, or being pedantic with walls of text. It was treating the LLM like an unreliable machine on an assembly line and building a quality-control cage around it.

In process-control terms, I stopped treating each model response as a finished product and started treating it as an intermediate output that needed inspection, disposition, and escalation rules. The goal wasn’t to eliminate variation in the model, but to design a process that could tolerate variation in response quality and still produce reviewable outputs.

The Philosophy of Designed Distrust

Most people treat LLMs as oracles…. they see all, they know all. In Veritas, I treat them as stochastic (probability-based) machines with an expected error rate of unknown quantity. I don’t expect the AI to be “right” all the time, in fact I designed the system with the expectation that it’ll be wrong. One AI agent is asked to “prove it”, while another AI agent is asked to “disprove it” with as little bias as possible.

This is the “Yin and Yang” of the design: a deliberate tug-of-war between two opposing forces.

1. The Yang: Expansion and Recall

The first stage of our pipeline is the hypothesis agent. Its goal is recall. It acts as a wide-angle lens, reasoning about the architecture’s entry points, trust boundaries, and threat models to surface every plausible candidate. At this stage, the cost of generating a hypothesis that later gets refuted is intentionally kept low. Compare this to the cost of missing a real finding can be much higher, and it’s economical.

2. The Yin: Contraction and Precision

The second stage is the evidence agent. Its goal is precision. It’s a “skeptical auditor” mandated to actively refute findings unless they are supported by specific, cited source-code evidence. It operates against the actual file content, checking for sanitizers or validation logic that the hypothesis might have missed.

The Secret Sauce: The Information Bottleneck

The most counter-intuitive part of this design is that we intentionally make the second agent “dumber” to make the overall system smarter.

In my pipeline, I use a slimming function called slim_hypotheses_for_evidence(). Before a hypothesis reaches the evidence agent, we strip away the "why" (the previous agent's reasoning).

Why delete data? To defeat anchoring bias. If an LLM sees a plausible-sounding reason why a bug exists, it can anchor on that explanation and overfit its verification to the prior given hypothesis. By stripping the reasoning between the stages, we force the evidence agent into a lower-bias verification state. It must find the exploit path itself using the provided code context, or the finding is discarded.

The Deterministic Anchor: The Policy Gate

The AI never has the last word. The final verdict is issued by the equivalent of a mechanical policy gate.

This gate doesn’t “reason”, it applies a deterministic checklist:

Confidence Rubrics: Findings with confidence scores below 0.3 are automatically marked “Inconclusive.”
Fail-Safe Handling: Any critical or high severity finding that remains “inconclusive” is flagged as NEEDS_HUMAN rather than being silently dropped.
Reconciliation: The gate audits the entire pipeline to ensure every initial “pre-scan” finding has been either confirmed, disproven, or flagged for a human expert.

An Assistant, Not a Replacement

Veritas is not intended to replace SAST tools like CodeQL or traditional static analysis. It is a targeted code review assistant POC for developers and Security Champions that take part in large manual code security reviews.

By using “adversarial verification”, token count is traded for output quality. The system burns some extra “capital” (compute and tokens) on the assembly line to ensure that when a developer finally looks at a report, they aren’t chasing hallucinations… they’re reviewing findings tied to cited code evidence.

Accuracy was not only a function of model selection, size, or number of parameters. In this design, it became the byproduct of architectural tension, a deliberate expansion phase followed by a skeptical contraction phase. In the gap between the “expansion” of a hypothesis and the “contraction” of evidence, I found a process I can actually trust. The next step in the project is empirical calibration: measuring these confidence scores and outcomes against more known vulnerable and non-vulnerable cases.

Technical Appendix: Separating Generation from Judgment

I’m including some of the technical approach constraints and principles here so that anyone actually interested in the assembly line conveyor belt analogy and how those were designed has some insight:

Core Problem

Most vulnerability scanners fail in one of two directions: they either miss real issues or they flood you with noise. This project deliberately splits the scanning process into two purpose-built stages:

Hypothesis: one optimized to identify as many plausible candidates as possible
Evidence: one optimized to confirm only what’s supported by code evidence provided

This technical section explains (at a very high level) how splitting the scanning process into two distinct agent-driven LLM phases reduces the structural tension that often drives false negatives and false positives in single-pass scanners.

Keep in mind I’m also using Type 1 and Type 2 errors in the practical code scanning sense: “Type 1” as a false positive, and “Type 2” as a false negative.

Veritas’ Core Method of Operation

The separation does one fundamental thing: it decouples generation from judgment. In a single-stage scanner, the same reasoning pass that notices a suspicious pattern must simultaneously decide whether it is a real vulnerability. That dual purpose in a single pass creates pressure for the system to make decisions in both error directions. Splitting the work into two distinct agents with distinct narrow task execution orders and different inputs breaks that contradiction.

Reducing Type 2 Errors (False Negatives aka missed findings)

The hypotheses agent is structurally free to over-generate. (It is BUILT to create plausible finding candidates)
The hypotheses agent works from abstracted threat context, not raw code
The pre-scan feeds directly into hypothesis generation as candidates, not fact
The evidence agent has a hard rule against silent omission of pre-scan findings

Reducing Type 1 Errors (False Positives aka unsupported findings)

The evidence agent has an explicit mandate to refute
The evidence agent operates against the provided file content, not just abstractions
The confidence rubric is adjusted asymmetrically
Slimming (selective redaction and artifact retention) reduces confirmation bias from the evidence agent’s input
The file_not_in_context rule prevents invented confirmations

The Combined Effect

The separation produces a pipeline where the hypothesis stage is recall-maximizing (biased against false negatives) and the evidence stage is precision-maximizing (biased against false positives), with each stage’s bias deliberately counteracting the failure mode of the other. Neither stage is trying to do both jobs simultaneously, which reduces the internal conflict that can make single-stage approaches unstable. It’s conservative enough to miss subtle issues, yet still permissive enough to produce unsupported claims.

Put differently, the pipeline separates the operating points. The hypothesis stage intentionally runs at a high-recall, low-precision operating point. The evidence stage then runs at a higher-precision operating point using stricter evidence requirements. Instead of forcing one model call or chat session that sits on a single compromise point on the precision/recall curve, the system composes two stages with different objectives and preserves all of the evidence/artifacts for auditability.

Yin, Yang, and the LLM: Engineering Reliability into AI Code Scanning was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.