LLM as an (Opinionated) Judge

The evaluation problem

Building systems around large language models has become standard practice. Use cases span a wide spectrum; from narrow, well-defined tasks like classifying spam emails, to open-ended ones like generating plain-language summaries of complex legal contracts or even surfacing the key issues buried within them. On the implementation side, the barriers have dropped considerably. The harder problem is one of evaluation: how do you actually know if the output is correct? And why, given how capable these systems have become, is that question still so difficult to answer?

The first complication is that LLMs are stochastic by nature; the same input can produce different outputs across runs (which troubles the case even for the more simpler classification cases). We can reduce it by setting temperature to 0, nudging the model toward deterministic behavior, but that comes at a cost: LLMs genuinely benefit from some looseness in how they sample, and forcing determinism tends to hurt quality.

Another common approach is a black-box evaluation; run the same inference multiple times and draw conclusions from the aggregate. Same as we did before for every complex classifier like super Deep Neural Networks. But now we’ve opened a different can of worms: is 2-out-of-3 correct classifications good enough? Should we do this in production too, always running multiple requests? How to aggregate the results?

Things get harder when the output isn’t a classification label but free text. A common workaround is to define a checklist of key elements a correct answer should contain and score based on coverage. But this breaks down fast. The same idea can be expressed many different ways. A text can tick every box and still be wrong. Two outputs can both be “correct” while one is clearly better than the other. And none of this helps when we’re trying to evaluate something like tone, clarity, or trust; qualities that don’t reduce to a checklist at all.

So how to evaluate automatically, at scale, without reading every output by hand? That’s where LLM-as-a-judge commonly comes in.

LLM-as-a-Judge

LLM-as-a-judge was formally introduced by Zheng et al. in their 2023 paper “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena” (NeurIPS 2023, UC Berkeley). The paper tackled a concrete problem: as chat-based LLMs improved rapidly, the field had no scalable way to evaluate open-ended response quality. Human evaluation is the gold standard but it’s slow, expensive, and hard to reproduce consistently. The authors proposed using a strong LLM, specifically GPT-4, as an automated judge, and validated this against human ratings at scale using MT-Bench (a multi-turn reasoning benchmark) and Chatbot Arena (a live human preference dataset). Their key finding was that GPT-4 judgments aligned with human preferences at a rate comparable to the agreement between the human raters themselves. That was enough to establish LLM-as-a-judge as a legitimate evaluation methodology, not just a convenient shortcut.

The core idea is simple: instead of checking an output against a rigid reference, ask an LLM to evaluate it. This is particularly useful for exactly the grey areas that rule-based approaches struggle with; is this answer factually grounded? Is it clearly phrased? Is it better than the alternative?

In practice LLM-as-a-judge comes in a few flavors:

Pairwise comparison; give the judge two outputs and ask which is better. Useful for preference evaluation and model comparison.
Single-answer scoring; ask the judge to rate one output across specific dimensions like accuracy, clarity, or conciseness.
Reference-grounded fact-checking; give the judge source material and ask it to verify whether the output is faithful to it.

Each flavor has its natural use cases, and as we’ll see, each carries its own failure modes. While the technique is genuinely useful, it’s far from plug-and-play, and this is what we’ll tackle today.

Failure point 1: Every judge brings a perspective

A widely cited best practice is to use a different LLM as the judge than the one that generated the output. The intuition makes sense; a different model is less likely to repeat the same mistakes, and cross-model agreement feels like a stronger signal than self-validation. But this assumes the two models are roughly neutral relative to each other on the task. In practice, that assumption breaks down more than you’d expect.

Here’s a concrete example. I was building an investment analysis pipeline; technical analysis first, then a lightweight LLM for early filtering, then a stronger model for the final verdict. Claude handled the second stage; Gemini Pro came in as the judge. The setup looked principled: using one of the best available models to verify another. What actually happened was striking; Gemini Pro consistently flagged Claude’s investment suggestions as too risky, not worth pursuing, ending up saying almost all Claude recommendations are not valuable.

So I ran a different test: kept the same initial pipeline structure but divided the second-stage role, once using Claude and once Gemini Pro, and finally compared their actual returns. Claude’s suggestions outperformed. It highlighted the fact that the problem wasn’t that the judge was broken. The problem was that the judge had a different disposition; Gemini Pro seems to lean conservative on risk, Claude seems to lean more aggressive. Neither is wrong in the abstract. But using one to evaluate the other on an inherently opinionated task isn’t cross-validation. It’s a clash of priors dressed up as quality control.

This points to something we tend to overlook: we compare LLMs on general benchmarks, looking for proxies of overall capability. What those benchmarks don’t capture are the softer behavioral tendencies each model carries; its default stance on risk, its tolerance for ambiguity, its implicit preferences when making judgment calls. These don’t show up in leaderboard scores. They surface the moment you put one model in the position of judging another.

The implication isn’t that cross-model evaluation is wrong, it’s that it needs to be scoped to the task carefully. For objective, verifiable questions; “does this list of terms appear in this document?” — model disposition barely matters. But for opinionated tasks; “is this investment sound? Is phrasing A clearer than phrasing B?” — here we need to understand what each model’s baseline stance is before trusting it as a neutral arbiter. An important note to remember is that a judge with a strong prior isn’t judging your output. It’s overwriting it.

Failure point 2: The judge inherits the crime scene

This one is subtler, and in some ways more dangerous, because it looks like rigor while undermining it. It happens when we give the judge LLM the full context; the original input, the first model’s output, and its reasoning, and ask the LLM to verify the conclusion. The intent is thorough evaluation. What actually happens is that we’ve turned the judge into a reader of the first model’s argument, not an independent assessor of the facts.

We ran into this while building a system to classify code snippets as malicious or benign. The pipeline used one LLM to classify and produce an explanation, and a second LLM to verify it. The verifying model got everything: the original code, the verdict, and the first model’s reasoning. That seemed complete; the judge had all the information.

The problem showed up in adversarial cases. When the first model was confidently wrong — say, correctly identifying a snippet as malicious but for a completely hallucinated reason (“this code is dangerous because it uses console.log, and console.log is inherently malicious”). The judge consistently failed to catch it. A fabricated but coherent explanation was enough to anchor its assessment. The Judge LLM followed the reasoning rather than interrogating the conclusion.

The root cause is something fundamental about how LLMs process input: a language model assigns weight to every token it receives. There’s no built-in mechanism for it to discount a section of the prompt as “potentially unreliable”. When you include a fluent, confident explanation from another model, the judge treats that explanation as evidence. The more convincing the hallucination, the more confidently the judge validates it. At that point you’re not checking if the answer is correct; you’re checking if it’s internally consistent. Which is a very different thing.

A related version of this surfaces with scoring. If the first model outputs a confidence score — say, “8/10”, and the judge receives it, the judge will anchor on it. Even if the two models use different internal scales, the judge will try to reconcile the numbers rather than ignore the provided value. The scale gets imported without verification.

The fix is conceptually simple: strip the first model’s reasoning before passing context to the judge, or at minimum run an evaluation pass where the judge sees only the original input and the final verdict, not the path taken to get there. If the conclusion holds up against the raw evidence without the supporting argument, it’s on firmer ground. The hard part is resisting the temptation to give the judge everything. More context feels like more rigor. Sometimes the most honest evaluation is a deliberately constrained one.

Failure point 3: LLMs have personalities, and they don’t disclose them

Here’s something that doesn’t get said enough: LLMs are not neutral instruments. Every model that reaches deployment has been shaped by its training data, its fine-tuning, and the values its creators chose to reinforce. Those choices leave fingerprints; not just on politically sensitive questions, but on subtle, task-specific behaviors that can quietly skew your evaluation pipeline without ever triggering an obvious error.

The obvious version of this is easy to demonstrate. Ask DeepSeek about Tiananmen Square and you’ll get a deflection. That’s a known, documented constraint, and most practitioners account for it. The harder cases are the ones nobody documented anywhere, where the model’s behavior reflects a latent disposition rather than an explicit policy.

We hit this during our malicious code detection work. Running the same classification task across multiple LLMs, we noticed the models didn’t agree on what “malicious” meant in practice. Commonly some models flagged snippets, other models considered as clearly benign. The disagreement wasn’t random noise; it was systematic and consistent per model. Each had effectively internalized a different threshold for what crosses the line, and none of them disclosed that threshold anywhere in their output.

This matters for LLM-as-a-judge because it means that when you ask a model to evaluate another model’s output, you’re not getting an objective reading. You’re getting that model’s interpretation, filtered through priors it can’t fully articulate and wouldn’t necessarily flag even if it could. Two models can receive the same verdict and the same evidence and reach opposite conclusions! not because one is smarter, but because they’ve defined the evaluation criteria differently at a level below the surface of the prompt.

The takeaway isn’t to stop using LLMs as judges. It’s to stop treating them as blank slates. Before deploying a LLM-judge on a task, invest time in characterizing the LLM baseline behavior: how does it interpret the key terms in your domain? Where does it land on the relevant spectrums; strict vs. permissive, risk-averse vs. risk-tolerant, conservative vs. liberal in its interpretation of ambiguous cases. That characterization won’t be perfect. But it turns the judge from an unknown quantity into a known one with understood tendencies, which is a much safer foundation.

Don’t naively run the same prompt across different LLMs and assume the results will be comparable. Model personality isn’t a bug to be patched, it’s a structural property to be mapped.

Failure point 4: Right answer, wrong reasons

The failure modes so far were about the judge getting things wrong. This one is different; it’s about the generator appearing to get things right, for reasons that are entirely fabricated.

During our malicious code detection research, we observed a pattern that was rare but systematic: a model would produce the correct verdict — malicious or benign, accompanied by a completely hallucinated justification. Not vague reasoning, but confidently stated, structurally coherent explanations with no grounding in the actual code. “This snippet is malicious because it generates birds on the screen.” “This code is safe because it runs with a randomized filename.” The verdict landed correctly; the path taken to get there was invented.

What makes this especially relevant for LLM-as-a-judge is how poorly standard evaluation handles it. If your judge is just checking the verdict — was the classification correct?, it will miss this entirely. The model looks right. But if that model is part of a pipeline where downstream steps rely on the reasoning, or where the explanation gets surfaced to an end-user, the hallucinated justification is a real problem regardless of whether the label happened to be correct.

There’s a deeper issue here too. A model that reaches correct conclusions via fabricated reasoning isn’t reliable, it got lucky. It has no consistent mechanism connecting its output to the underlying reality, which means the next time it faces a similar input, the verdict could easily flip while the explanation remains equally confident and equally wrong. And even if the model “knows” something it simply can’t articulate, that’s not a comfort; if you can’t verify the path, you can’t tell genuine intuition from a coin flip.

It should inform our design; verdict accuracy and reasoning quality need to be evaluated separately, not bundled into a single pass. A model with good intuition and bad analysis will pass verdict-only evaluation comfortably and fail the moment the reasoning gets scrutinized.

This is also one of the few cases where LLM-as-a-judge, applied carefully, can actually help: ask a judge to evaluate only the reasoning; “given this code, does this explanation make sense?”, without surfacing the verdict. That isolates exactly the failure modes we’re trying to catch. The verdict and the reasoning become two independent checkpoints rather than one bundled output that either passes or fails together.

Best practices and anti-patterns

The failure modes above aren’t reasons to avoid LLM-as-a-judge. They’re a map of where it requires deliberate choices rather than defaults. Here’s what that looks like in practice.

Match the judge to the task type first

The most important question to ask before setting up any evaluation is: is this task objective or opinionated? For objective questions; “does this term appear in the text? Is this value present in the output?”, model disposition barely matters, and LLM-as-a-judge works reliably. For opinionated questions; “is this phrasing clearer? Is this investment sound?”, we’ll need to understand the LLM-judge’s baseline stance before trusting its verdicts. Skipping this step is the root cause of most LLM-as-a-judge failures I’ve seen in practice.

Characterize your judge before you deploy it

Don’t assume the judge is neutral until you’ve checked. Run it against cases you’ve already evaluated manually, and look for where it systematically diverges. If it diverges consistently in one direction, that’s not noise; that’s personality. Either factor it in or pick a different judge for that task.

Control what context the judge receives

Giving the judge the full context; original input, first model output, and its reasoning, is one of the most common anti-patterns you’ll face. It feels thorough but turns the judge into a reader of the first model’s argument rather than an independent evaluator. Instead:

Pass the original input and the final verdict only. Let the judge assess whether the conclusion holds against the raw evidence, without being walked through the reasoning that produced it.
If reasoning quality is what you’re evaluating, pass the reasoning without the verdict, so the judge evaluates the logic on its own merits rather than anchoring on the conclusion.
Never pass scores or confidence values from the first model to the judge. Numerical anchors import the first model’s scale into the evaluation without any verification.

Evaluate verdict and reasoning independently

Don’t bundle them. A model can be right for the wrong reasons. If the explanation reaches users or downstream systems, reasoning quality matters as much as verdict accuracy. Treat them as two separate checkpoints.

Use cross-model evaluation as triangulation, not arbitration

Running multiple models on the same task and comparing outputs is a legitimate way to build confidence, a form of stacked generalization which is a technique borrowed from classical ML. But it works as a signal, not a decision rule. When two models agree on an objective question, that agreement is meaningful. When they disagree on an opinionated one, a majority vote doesn’t resolve the disagreement, it just hides it. Treat divergence as a flag for human review, not a tie to be broken automatically.

One exception: code

When using LLMs to write or review code, having multiple models critique each other is genuinely productive. Code is unique by the fact it inherently has ground truth; it runs or it doesn’t, it’s secure or it isn’t. In that context, the opinionated tendencies that make cross-model evaluation unreliable for subjective tasks become useful. Different models catch different classes of errors, and their disagreements point to real ambiguities worth investigating rather than just differences in taste. If before our common modus operandi was majority vote, here we’ll be more interested in “any model raised a flag” mode, making sure to cover as many code aspects as possible.

Conclusion: Use it, but don’t trust it blindly

LLM-as-a-judge is one of the most useful tools we have for evaluating language model outputs at scale. It fills a genuine gap; the space between rigid rule-based checks and expensive human review, and when applied thoughtfully, it delivers real signal. The goal here wasn’t to discourage its use, but to push back against the version that gets deployed naively, where a second model is pointed at the first model’s output and the result is simply treated as ground truth.

The failures we’ve covered share a common thread: they all come from forgetting that the judge is itself a language model. It has hidden dispositions. It anchors on whatever context it receives. It can be misled by confident, fluent reasoning just as easily as a human reader. It interprets your task through a lens shaped by its training, not through the neutral, objective frame we tend to assume when we reach for it.

The version of LLM-as-a-judge that works is a constrained, deliberate one. Know your task type before choosing your judge. Characterize the judge’s behavior on your domain before trusting its verdicts. Control what context it receives, and keep verdict evaluation separate from reasoning evaluation. Treat cross-model agreement as a confidence signal, not proof of correctness.

None of this is complicated in hindsight. But it requires treating evaluation as a first-class part of the system; not something you bolt on after the main pipeline is built. LLM-as-a-judge, done right, is a powerful part of that. Just not a shortcut out of it.

LLM as an (Opinionated) Judge was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.