Before You Tune Your Judge, Tune Your Rubric

Why most LLM-as-judge reliability work starts in the wrong place.

You wired up an LLM judge three months ago. Scores were noisy. You raised K from 1 to 3, then to 5. You tightened temperature to 0. You went from a 50-example test set to 200. You tried a stronger model. You added chain-of-thought. Your scores are still noisy on the dimensions that matter most.

This is the common trajectory, and most of those moves are aimed in the wrong direction. The dominant source of unreliable LLM judge scores is not the model and not the sampling strategy. It is the rubric itself.

Two kinds of variance look identical and require opposite treatments

When an LLM judge returns different scores on identical inputs, one of two things is happening.

Stochastic variance. The judge holds a single interpretation of the rubric. Token-level sampling noise produces slightly different scores across passes. Averaging more passes reduces it. Logprob-weighted expected value reduces it. Temperature=0 reduces it.

Interpretive variance. The judge holds more than one internally coherent reading of a rubric dimension across passes. On one pass “financial narrative coherence” is about whether the output matches the stated goals. On the next pass it is about whether the numbers reconcile with source data. Both readings are defensible. The average is a meaningless composite of different questions.

The two look identical from outside. You see a spread of scores and a standard deviation. Nothing in the output tells you which kind of variance you have. The treatments are opposite. Averaging cannot reduce interpretive variance, because you are not averaging noise on a measurement, you are averaging across measurements of different things.

Most teams treat all observed variance as stochastic and scale the treatment that works for stochastic: more passes, logprobs, ensembles. When the variance is interpretive, every dollar spent there is wasted. Worse, the averaged composite looks more stable than it is, which produces false confidence in the score.

The Nguyen 2025 paper formalizes this using the interpretation set of a specification. A rubric with one valid interpretation has |I(D)| = 1. A rubric with more than one has |I(D)| > 1. Tightening the substrate is how you collapse the interpretation set. Averaging is not.

How much of your variance is specification, not sampling?

The Grundl 2026 study is the cleanest evidence available. Grundl ran over 1,200 LLM judge reviews across a three-task design that progressively constrained the task: Task 1 was open-ended, Task 2 specified the research design, Task 3 additionally provided cleaned data.

Between Task 1 and Task 2, score standard deviations collapsed across every frontier model tested: 1.0 to 0.8 for GPT-5.4, 2.3 to 1.3 for GPT-5.3-Codex, 2.7 to 1.8 for Opus 4.6. Between Task 2 and Task 3, the drop was approximately zero. Tighter specification reduced variance by 30–50%. Cleaner inputs on top of a tight spec did nothing.

The implication is uncomfortable for most production pipelines. Tightening the rubric reduces variance. Tightening the inputs to the judge does not, once the spec is tight. Teams that invest in better evaluator data pipelines before they invest in better rubric specifications are spending on the wrong axis.

What a tight rubric actually contains

Five patterns, ordered by impact. None are esoteric. Most are missing in production rubrics we see.

Per-point anchors, not just the endpoints

A 1-to-5 rubric that defines only “1 = fails” and “5 = excellent” leaves the judge to extrapolate the middle. Models extrapolate toward the mean. Human rating corpora skew toward 3 stars, and RLHF-trained models inherit the prior. You get central-tendency bias, and it does not cancel across passes.

The fix is a one-sentence prose anchor for every point on the scale, with at least one concrete example per anchor. Grundl’s reference rubric uses a five-band 1–100 scale where each band is labeled (91–100 “highly credible,” 76–90 “strong with limitations,” and so on) and scorers are told to place the output inside a band, not between bands. This is the pattern to copy.

A useful test: if you cannot write distinct prose for every point on your scale, you have too many points. Cut the scale until the anchors are distinct.

Evidence anchoring

The highest-leverage single change we have measured. Every claim the judge makes about the output must quote the specific text it is referring to and cite the specific source material that supports it. “The output references a goal that is not in the source notes” is a scorable claim only when the judge has quoted the output sentence and cited the source field. Ungrounded claims score zero on the relevant dimension.

Two things happen when you force this. The judge’s reasoning becomes auditable: a human reviewer can check any score against a specific quote. The judge also stops making claims it cannot support, because fabricated reasoning becomes expensive to produce inside the prompt structure.

The pattern is strongest on dimensions where the judge is evaluating against source data: correctness, fidelity, accuracy, groundedness. It is also useful on softer dimensions as a forcing function on specificity.

One criterion per prompt

Multi-criteria rubrics in a single judge call produce correlated scores across dimensions. The judge reads the output, forms a gestalt impression, and projects that impression onto each score. This is the halo effect, and it is mechanical in autoregressive models: every later score token is conditioned on every earlier score token in the output.

The fix is one judge call per rubric dimension. The LLM-as-a-Verifier paper formalizes this as criteria decomposition (the C axis) and averages across separate calls. Operationally, decomposition costs more in API calls and buys back enough score independence to be worth it. On dimensions most susceptible to halo (tone, style, impression-based scoring) the effect size is large.

Score first, then explain

Autoregressive generation means the score token is conditioned on every token that preceded it. If the judge reasons before scoring, the score is sampled from a distribution that includes the explanation the model just wrote. The explanation biases the score, systematically, not occasionally.

Ask for the same rubric with score-first and with explain-first ordering and you get different score distributions on identical inputs. The difference is not noise. It is a prompt-structure artifact.

Always put the score first in the output schema. Let the explanation follow as justification of an already-committed score. The alternative is two API calls, one for the score and one for the explanation, which breaks the coupling entirely at twice the cost. Score-first in a single call captures most of the benefit for free.

Binary or few-band scales beat 1–10

LLMs are not calibrated to produce consistent scores on arbitrary numeric ranges. A 7 on one run is a 9 on the next, not because the output changed but because there is no stable mapping from quality to the digit 7 versus 8 versus 9. The practical guidance from Evidently, LangChain, and Hamel Husain’s widely-cited guide converges: binary pass/fail or three to five named bands outperforms 1–10 scales on human-alignment metrics in most settings.

A useful test: can a human applying your rubric consistently distinguish a 6 from a 7? If not, the scale is finer than the signal supports. Collapse it.

One caveat. On dimensions where you rank or compare outputs, finer granularity helps. The LLM-as-a-Verifier paper uses a 20-point scale and shows monotonic improvement in ranking accuracy with granularity, but only when logprob-weighted expected value is available. Without logprobs, the high-granularity scale is false precision. Default to coarse bands unless you have the logprob machinery to justify going finer.

Three patterns that leak

Instructions practitioners reach for that quietly fail.

“Do not reward X” instructions leak. Grundl’s reviewer prompts explicitly told the judges not to reward sign, magnitude, or significance of the research findings. They did anyway. Submissions with positive preferred estimates scored approximately 3 points higher on a 0–100 scale after controlling for reviewer, task, and source. Negative instructions do not enforce. They reduce an effect at the margin and leave a systematic bias intact.

The fix is to move enforcement out of the prompt and into code. If a passing composite requires compliance to hold, check compliance after the judge returns and gate the composite in the pipeline, not in the judge’s instructions. The judge outputs a score. The pipeline decides what the score means.

Filtering in the prompt. Asking the judge to “ignore data older than 90 days” is a negative instruction dressed up as a rule. It leaks the same way. Filter the inputs before they reach the prompt. Old data never arrives at the judge.

Compliance gating inside the judge. Telling the judge “if any compliance rule is violated, score the whole output as failing regardless of other dimensions” produces two failures. The compliance dimension contaminates the other dimensions through the autoregressive coupling described earlier. The judge sometimes scores on general impression and silently ignores the gate. Pull the gate into code. The judge scores compliance as one dimension. The pipeline decides whether the composite passes.

How to know your rubric is working

Two diagnostics and one validation gate.

Per-dimension standard deviation is the substrate health signal. When we run a rubric multiple times on the same output, we log SD per dimension, not just the composite. High SD on a single dimension means interpretive variance on that dimension, which means that dimension needs specification work. Averaging more passes is the wrong treatment. The fix is upstream: clearer anchors, evidence anchoring, or further criterion decomposition.

Cross-model reviewer agreement is a diagnostic, not a validation. If three different judge models produce the same ranking, that is useful information about reproducibility across model choice. It is not evidence that the ranking is correct. Shared training data produces shared biases. Agreement is consistent with “all judges right” and equally consistent with “all judges wrong in the same direction.” The Grundl paper made exactly this framing error at the validation level: the cross-reviewer agreement presented as evidence of trustworthy rankings is the same agreement that masked the positive-estimate bonus all four reviewers leaked.

Cohen’s Kappa against human labels is the validation gate. You need humans in the loop somewhere, even 20 labeled examples in a narrow domain. Measure Kappa between the judge and the humans on a held-out set the judge prompts never saw during tuning. Kappa captures agreement pattern, which is what matters. Correlation alone does not; the NVIDIA 2025 paper on 54 LLM judges showed correlation above 0.9 can coexist with Kappa below 0.3 when the judges and humans disagree on where the category boundaries sit. Ship the rubric when Kappa crosses a threshold grounded in the cost of disagreement. Do not ship on cross-model agreement alone.

What this buys you

A rubric built this way changes the economics of the pipeline. Specification effort is a one-time cost per rubric. Averaging, logprobs, ensemble models, and bigger judges are recurring costs per call. When the spec is tight, K=3 averaging is enough for most dimensions, logprobs become useful rather than necessary, and the judge model often drops a tier with no measurable quality loss.

The pattern we see in production: two to three weeks of focused rubric work per evaluator pays back in lower recurring inference cost and more stable scores than teams who ship a loose rubric and try to stabilize downstream. Tightening the substrate is the high-leverage move. Do that first.

The author works on AI at Advisor360°, where rubric-based LLM evaluation gates production pipelines that meet advisors and compliance officers on the other end.

Before You Tune Your Judge, Tune Your Rubric was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.