Online Evals Done Right: Runtime Scoring and Review Queues for Production LLM Systems

A practical guide to online evals that score live traffic, apply LLM-as-judge checks, route risky cases to review, and feed production failures back into offline tests.

Article 3 in a series on eval loops for production LLM systems, with a companion reference implementation in llm-eval-ops. Article 1 introduced the two-loop model. Article 2 covered offline eval gates before release. This article focuses on how to implement the online side in a way teams can actually ship.

Most teams do not get stuck on online evals because they lack a platform. They get stuck because they ask the platform question too early.

Should we use Arize? Langfuse? Build our own tracing layer? Buy a full observability product now?

Those are reasonable questions, but they are usually not the first ones. The first questions are simpler and more important:

What should the system check on every live response?
Which failures can be handled automatically?
Which slice should go to an LLM judge?
When should a human actually get involved?
How do confirmed production failures become future offline tests?

If a team cannot answer those questions, no framework will fix the problem. It will just give them a nicer place to store uncertainty.

That is the real implementation problem. The hardest part of online evals is not choosing tooling. It is designing the control loop.

The strongest implementation pattern I have found follows a specific order:

Run deterministic inline checks on every response
Assign a risk band
Send the uncertain slice to LLM-as-judge
Route only the unresolved, high-value cases to human review
Promote confirmed failures back into offline evals

That ordering matters. Human review is expensive. LLM judge calls cost money and add noise if used carelessly. Deterministic checks are fast, cheap, and consistent. Teams that get this order right usually make progress quickly. Teams that start with dashboards, frameworks, or broad human review workflows often create a lot of motion and very little control.

Concept diagram showing an online evaluation loop for a production LLM system: live traffic enters a live system, passes through runtime checks, LLM judge, and human review, then confirmed failures loop back into offline evals. — Online evals should automate routine checks, escalate only the uncertain slice, and turn confirmed failures into future offline tests. **Source:** Image by the author.

Start with one production decision

Do not begin with “we need online evals.” Begin with one concrete runtime decision you want the system to support.

Examples:

Should this response pass, degrade, or escalate?
When is a refusal actually wrong?
What counts as a high-risk answer?
What should trigger review now versus later?

Online eval is not just monitoring. It is a runtime decision system.

A lot of production LLM stacks already log prompts, completions, latency, cost, and user feedback. Useful, but incomplete. A response can come back quickly, stay within budget, and still be behaviorally wrong. It can refuse when the evidence is already present. It can cite documents the retriever never returned. It can produce valid JSON and still make the wrong decision.

That is why the online loop needs to be designed around behavior, not just telemetry.

The point of a control plane is not to show more numbers. It is to answer, quickly, whether the system is healthy, where risk is entering the funnel, and whether confirmed failures are being turned into future protections.

The online control plane summarizes the evaluation funnel from inline checks to judge review to human correction to offline learning. **Source:** Screenshot by the author from the llm-eval-ops reference implementation.

Step 1: Add deterministic inline checks first

The first implementation milestone is not human review. It is not LLM-as-judge either. It is deterministic checking on every response.

These are runtime guards on live traffic, not benchmark scoring for release decisions. Their job is to catch obvious failure modes quickly and consistently while the request is still in flight.

This is the layer teams should build first because it gives the most coverage for the least cost. In practice, this means checking things like:

Structured output validity
Citation integrity
Retrieval support
Policy adherence
Obvious blocker conditions

These are not abstract quality checks. They are concrete runtime signals.

A response that cites a source the retriever never returned is a different class of failure from a response with weak retrieval support. A response that breaks format rules is different again. A response that refuses even though the evidence is present is not a safe abstention. It is a wrong refusal.

A good inline layer does two jobs at once. It scores runtime behavior and emits specific failure signals the rest of the system can use.

Inline checks score every live response on deterministic runtime dimensions such as groundedness, citation validity, retrieval support, and format validity. **Source:** Screenshot by the author from the llm-eval-ops reference implementation.

A small set of trusted runtime dimensions such as groundedness, citation validity, retrieval support, and format validity is enough:

Schema valid or not
Citations grounded or not
Retrieval support above threshold or not
Key policy rule passed or failed

That is already enough to create useful risk signals.

Step 2: Add a small risk policy

Once you have deterministic checks, add a small routing policy. Keep it simple.

A three-band model is often enough:

Low risk: pass
Medium risk: candidate for semantic review
High risk: escalate or hold for review

The point is not mathematical elegance. The point is operational clarity.

A low-risk response might have valid structure, grounded citations, and strong retrieval support. A medium-risk response might be formally valid but carry weaker support or one advisory signal. A high-risk response might have a blocker like citations_not_in_retrieval, unsupported_claim_signal, or a likely wrong refusal despite strong retrieval.

If every questionable response goes to a human, the system will not scale. If nothing gets escalated, the score is not useful. The risk policy is what turns online eval into an actual control loop.

Do not start with ten risk classes, dynamic routing logic, and dozens of thresholds. Start with a simple policy that the team can explain in one minute.

Step 3: Use LLM-as-judge for the uncertain slice

Once deterministic checks are in place, add the judge layer. Not as the first evaluator or source of ground truth. As the second layer.

The judge is useful when the important question is not “did the schema parse?” but “did the system make the right semantic decision?”

That includes:

Was the answer actually supported?
Was the refusal justified?
Was the response mode correct?
Did the assistant behave in line with policy, not just format?

A practical rollout looks like this:

Always run judge on medium-risk traffic
Run judge on high-risk traffic where semantic confirmation helps
Sample a small portion of low-risk traffic to keep visibility on baseline quality

That gives teams signal without forcing judge cost onto every response.

LLM Judge dashboard view showing judge coverage, completed and failed records, judge escalations, and a results-by-slice table grouped by risk band with recent judge records below. — LLM-as-judge reviews the uncertain slice, tracks coverage and escalations, and helps decide which cases need human attention. **Source:** Screenshot by the author from the llm-eval-ops reference implementation.

This is where platforms like Arize or Langfuse can eventually help, but they are still not the first problem to solve. First decide what the judge is checking, when it runs, and what a judge escalation means. Then evaluate tooling based on whether it makes that workflow easier to run, inspect, and improve.

Step 4: Bring humans in only where automation falls short

A strong online eval workflow does not start with human review. It ends there.

By the time a case reaches a reviewer, something meaningful should have happened already:

A deterministic blocker fired
The judge found a semantic issue
The case is novel or important enough that automation should not settle it alone

That makes human review scarce, high-value work instead of routine cleanup.

A reviewer should not just click “bad response.” The review interface should capture a reusable correction artifact that the rest of the system can learn from: what the system returned, what it should have returned, and whether the case should be promoted into offline evals.

Here is a concrete example from the reference implementation. The assistant refuses to answer a refund-policy question even though retrieval has already surfaced the relevant policy chunk. The inline and judge layers identify this as a likely wrong refusal. The reviewer then corrects the expected outcome to supported_answer, writes the proper response text, and marks the case for offline export.

Reviewer Correction screen showing the system’s original response as a wrong refusal, reviewer notes, corrected expected outcome set to supported_answer, corrected response text, and a checked option to add the case to offline eval export. — Human review is reserved for cases where automation still leaves uncertainty. The reviewer corrects the expected outcome and can promote the case into offline eval export. **Source:** Screenshot by the author from the llm-eval-ops reference implementation.

That is a much stronger workflow than collecting thumbs-down events and hoping someone reads them later.

Step 5: Close the loop into offline evals

This is the part teams skip most often, and it is the most important one.

A production failure is only partially useful if it stays in the review queue. It becomes strategically useful when it is exported into a new offline eval case so the same mistake can be caught before the next release.

That is the loop.

In the reference implementation, corrected review items can be exported into a portable JSON artifact with fields such as expected behavior, reference response text, risk tier, tags, and metadata. That exported case can then be refined and merged into the offline dataset.

JSON file showing an exported offline eval case for a wrong refusal, including question, expected behavior set to answer, reference response text, tags, risk tier, and metadata copied from human review. — Confirmed production failures are exported into portable offline eval cases so the same issue can be caught before the next release. **Source:** Screenshot by the author from the llm-eval-ops reference implementation.

Without the loop back into offline tests, review is mostly triage. The system catches failures, labels them, and waits for them to happen again. With the loop, every important production miss has a path to becoming a future release guardrail.

That is the compounding effect teams should care about.

Where frameworks fit, and where they do not

Teams should be careful not to turn the tooling question into the starting point.

You do not need to choose Arize, Langfuse, or any other eval platform on day one to implement this. A lot of teams can get surprisingly far with:

Application logs
A few runtime scoring functions
A review table
A judge workflow
A simple operator UI
An export path into offline evals

That is enough to prove the loop.

Once the loop exists, the framework question becomes much easier. At that point, you know what you need to trace, which signals matter, what operators actually look at, where your bottlenecks are, and whether the pain is storage, dashboards, collaboration, or annotation.

That is when platforms become accelerators instead of substitutes for thinking.

The advice here is simple: design the control loop first, then choose the tooling that helps run it well.

A practical rollout plan for teams

One practical rollout sequence looks like this:

Week 1: Define one runtime decision and implement 3 to 5 deterministic checks
Week 2: Add risk bands and basic operator visibility
Week 3: Add judge review for medium-risk traffic and a small low-risk sample
Week 4: Add reviewer correction and offline export for confirmed cases
Week 5 and beyond: Tune thresholds, measure noise, and only then evaluate platforms or deeper instrumentation

The exact calendar will vary by team, but the sequence matters more than the dates. This path is usually faster than choosing a platform first and trying to invent a workflow around its UI.

Common mistakes

Starting with the dashboard. A dashboard is not an eval loop unless it changes routing. If scores do not affect what happens to the response, they are metrics, not guardrails.

Starting with human review. That creates expensive, noisy workflows before deterministic checks have had a chance to filter. Most responses do not need a human. The point is to make sure the ones that do are worth the reviewer’s time.

Starting with the platform decision. Tools can help a good control loop run more smoothly. They do not define one. Teams that adopt a platform before understanding their own eval signals end up shaping their workflow around the tool’s assumptions instead of their own requirements.

Stopping at review. This is the subtlest and most damaging mistake. If confirmed failures never come back into offline evals, the system stays reactive. Every release cycle re-exposes the same failure categories because no regression test was ever created from the production signal that caught them.

Closing

Offline evals decide what should ship. Online evals decide what should be trusted live. But the strongest systems do not stop there. They connect the two.

Implement deterministic inline checks first. Add a small risk policy. Use LLM-as-judge for the uncertain slice. Bring humans in only where automation still falls short. Then turn confirmed production failures into offline tests.

That is the implementation order that gives teams real control without forcing them to overcommit to tooling too early.

The companion repo, llm-eval-ops, implements the patterns described in this series.

Online Evals Done Right: Runtime Scoring and Review Queues for Production LLM Systems was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.