Structured Output for LLMs in Production: From json.loads() to Validated Objects

Structured Output for LLMs in Production: From json.loads() to Validated Objects

In the era of GenAI hype, every team has ideas — and every idea needs a proof of concept (PoC) by Friday.

During the PoC phase, nobody questions the LLM response. As long as json.loads(response.content) works and the output looks right, we move fast. We’re vibe coding, building fast, demo on Thursday, and the LLM is returning clean JSON on every test call. Ship it.

Then we move toward production.

Suddenly, the LLM starts returning markdown fences around the JSON, so we strip them. On the next call, the keys are slightly different, so we normalize them. Then a number comes back as a string, so we cast it. Before we realize it, we are 30 lines deep into a hand-rolled parser, and every new edge case adds another if statement.

Here’s the part that really bites: the database does not care that this system started as a 5-day PoC. Tables have schemas. Columns have types. The INSERT statement expects title VARCHAR, year INTEGER, and category VARCHAR — not whatever shape the LLM felt like returning today. That is exactly where probabilistic model output collides with deterministic application boundaries.

The gap between “the LLM gave me a JSON string” and “this data is ready for my schema” is where production bugs hide — the kind that happily pass every test we wrote during the PoC.

In this guide, I’ll walk through what structured output is, why it matters, and how to implement it properly with Pydantic and LangChain, using concrete code examples that show exactly what goes wrong at each stage when we don’t.

The Problem: LLMs Return Strings, Applications Need Structured Data

Let’s say we’re building a feature that recommends movies and stores them in a database.

At the prototype stage, the LLM call looks deceptively simple:

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": "Recommend a sci-fi movie. Return JSON with: title, year, genre"
}
],
)

text = response.choices[0].message.content
print(text)

We run this five times. Here’s what we get:

# Run 1 — perfect
'{"title": "Inception", "year": 2010, "genre": "sci-fi"}'

# Run 2 — wrapped in markdown fences
'```json\n{"title": "Arrival", "year": 2016, "genre": "sci-fi"}\n```'

# Run 3 — year is a string
'{"title": "Interstellar", "year": "2014", "genre": "Sci-Fi"}'

# Run 4 — different key names
'{"movie_title": "The Matrix", "release_year": 1999, "category": "Science Fiction"}'

# Run 5 — helpful prose before the JSON
'Here is my recommendation:\n{"title": "Blade Runner 2049", "year": 2017, "genre": "sci-fi"}'

Five calls, five different formats. Naturally, the first instinct is to do the obvious thing:

import json

def parse_movie(llm_text: str) -> dict:
return json.loads(llm_text)

It works for Run 1, but crashes on Run 2 and Run 5 with json.JSONDecodeError.

So we patch the parser to strip markdown and extract the JSON block. Now the syntax problem is solved.

But the real production problems are only beginning.

Run 3 gives us a string instead of an integer:

{"year": "2014"}

At first glance, downstream code still looks fine:

movie = parse_movie(llm_text)

db.execute(
"INSERT INTO movies (title, year, genre) VALUES (%s, %s, %s)",
(movie["title"], movie["year"], movie["genre"])
)

And that is exactly what makes this class of bug dangerous. Depending on the database, driver, or downstream system, the value may be coerced, rejected, or silently preserved as the wrong type. Even if the insert succeeds, we have already lost schema guarantees in application code.

Run 4 introduces a different failure mode.

{
"movie_title": "The Matrix",
"release_year": 1999,
"category": "Science Fiction"
}

It parses successfully, but now the application code breaks:

title = movie["title"]   # KeyError: 'title'

The natural response is to add more defensive logic: key normalization, type casting, and casing fixes. Before long, we are 30 lines deep into brittle parsing code for just three fields — and it is still fragile.

The next model version might return a different key, wrap the payload in XML-style tags, or prepend reasoning text above the JSON.

The challenge is no longer parsing syntax; it is enforcing a reliable contract between probabilistic model output and deterministic application systems.

Structured output is how we restore that contract.

The 4 Levels of Structured Output

After seeing how quickly parser debt accumulates, the next question becomes: how much control can we actually exert over LLM output?

In practice there is a clear spectrum — from “please try to follow this format” all the way to “the model literally cannot produce invalid output.” Most production systems evolve through these four levels.

Figure 1. The guarantee spectrum: from best-effort prompting to schema-enforced output.

Level 1: Prompt-Based (Hope)

At the first level, we rely entirely on prompt instructions.

response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": "Recommend a movie. Return JSON with keys: title, year, genre"
}]
)

The model tries its best, but everything we saw earlier can still happen: markdown fences, prose, wrong keys, or wrong types.

Guarantee: none.

Level 2: JSON Mode (Valid JSON, Wrong Shape)

We force valid JSON.

response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": "Recommend a movie. Return JSON with keys: title, year, genre"
}],
response_format={"type": "json_object"}
)

json.loads() now works reliably. The parsing problem is solved. But the schema problem remains — the model can still return wrong keys or wrong types.

Level 3: JSON Schema Mode (Right Shape)

We define the exact structure we want and no longer asking for it.

response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": "Recommend a sci-fi movie from the 2000s"
}],
response_format={
"type": "json_schema",
"json_schema": {
"name": "movie_recommendation",
"schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"year": {"type": "integer"},
"genre": {"type": "string", "enum": ["action", "comedy", "sci-fi", "drama"]}
},
"required": ["title", "year", "genre"],
"additionalProperties": False
}
}
}
)

The model must return the correct keys, correct types, and valid enum values.

Level 4: Constrained Decoding (Schema-Enforced Output)

We don’t just validate after generation — we constrain generation itself.

response_format = {
"type": "json_schema",
"json_schema": {
"name": "movie_recommendation",
"strict": True,
"schema": { ... } # same schema as above
}
}

With strict=True, the model literally cannot generate invalid output (for example, it cannot return a string for a field defined as integer). This is currently the strongest guarantee available.

Structure is now enforced during decoding itself and this naturally raises the next question: How do we get all these guarantees without manually writing raw JSON Schema by hand?

For three fields this is manageable. For 15 fields, nested objects, enums, and validation constraints, it quickly becomes tedious and error-prone.

This is exactly where Python developers want something more ergonomic.

Enter Pydantic: Schemas in Python

By this point we have seen how quickly raw strings and hand-rolled parsers become brittle. Once the response structure grows beyond a few flat fields, maintaining clean output handling turns into a maintenance nightmare.

Instead of writing raw JSON Schema by hand, we define the structure directly as a Python class:

from pydantic import BaseModel, Field
from enum import Enum

class Genre(str, Enum):
ACTION = "action"
COMEDY = "comedy"
SCI_FI = "sci-fi"
DRAMA = "drama"

class MovieRecommendation(BaseModel):
title: str = Field(description="Full movie title, without year or parentheses")
year: int = Field(ge=1888, le=2030, description="Release year as a 4-digit number")
genre: Genre = Field(description="Primary genre - pick exactly one")
Figure 2. A single Pydantic model automatically generates schema and validates runtime data at the application boundary.

With a single Python class, we get three major benefits at once: automatic JSON Schema generation, built-in validation, and field-level guidance for the LLM.

Automatic JSON Schema Generation

The first benefit is that Pydantic automatically generates JSON Schema for us.

import json

print(json.dumps(MovieRecommendation.model_json_schema(), indent=2))
{
"$defs": {
"Genre": {
"enum": ["action", "comedy", "sci-fi", "drama"],
"title": "Genre",
"type": "string"
}
},
"properties": {
"title": {
"description": "Full movie title, without year or parentheses",
"title": "Title",
"type": "string"
},
"year": {
"description": "Release year as a 4-digit number",
"minimum": 1888,
"maximum": 2030,
"title": "Year",
"type": "integer"
},
"genre": {
"$ref": "#/$defs/Genre",
"description": "Primary genre - pick exactly one"
}
},
"required": ["title", "year", "genre"],
"title": "MovieRecommendation",
"type": "object"
}

There is no hand-written JSON Schema to maintain. We update the Python class, and the schema updates automatically. That is one of the biggest “aha” moments in structured output pipelines:

define the structure once in Python, and the schema can then be reused downstream.

Built-in Validation

The second advantage is validation. Pydantic validates and, when appropriate, coerces data as it enters the object.

movie = MovieRecommendation(
title="Inception",
year="2010",
genre="sci-fi"
)

print(movie.year) # 2010
print(type(movie.year)) # <class 'int'>

Here, "2010" came in as a string, but Pydantic converted it into an integer.

Now compare that with invalid values:

MovieRecommendation(
title="Inception",
year=2010,
genre="banana"
)

# ValidationError:
# Input should be 'action', 'comedy', 'sci-fi' or 'drama'

MovieRecommendation(
title="Inception",
year=99999,
genre="sci-fi"
)

# ValidationError:
# Input should be less than or equal to 2030

The failure now happens exactly where it should: at the application boundary, before invalid data can leak into downstream logic. That is precisely where schema violations should fail in a production system.

Schema as Model Guidance

The third benefit is often overlooked: the description= text is not just documentation for humans. It becomes part of the generated schema, which in turn helps guide structured generation.

For example:

year: int = Field(
ge=1888,
le=2030,
description="Release year as a 4-digit number"
)

This acts as a field-level instruction. It is significantly more precise than simply naming the field year and hoping the model interprets it correctly. In practice, this is where schema design starts improving output quality, not just output validation.

So far, we’ve focused on structure and validation at the schema level and this is only half of the story. In production systems, structure alone is not enough; we also need a reliable way to move validated data between the LLM and the application boundary. That brings us to serialization

Serialization: The Bridge Between the LLM and the Application

There’s one concept that ties everything together: serialization.

An LLM returns JSON text. Our application needs a validated Python object. Serialization (and its reverse, deserialization) is the process that moves data across that boundary.

Without structure, this step becomes the fragile hand-rolled parser we saw earlier. With Pydantic, it becomes a single method call.

llm_response = '{"title": "Inception", "year": 2010, "genre": "sci-fi"}'

movie = MovieRecommendation.model_validate_json(llm_response)

print(movie.title) # "Inception"
print(movie.year) # 2010
print(movie.genre) # Genre.SCI_FI
Figure 3. Pydantic replaces fragile parsing with a validated boundary between raw LLM JSON and application-safe objects.

We are no longer converting text into a loose dictionary. We are converting it directly into a validated domain object — typed, coerced, and application-safe.

When we need to send data back out, Pydantic gives us clean round-trip methods:

movie.model_dump()
# {'title': 'Inception', 'year': 2010, 'genre': 'sci-fi'}

movie.model_dump_json()
# '{"title":"Inception","year":2010,"genre":"sci-fi"}'

The fragile parser we started with has now been replaced by a clean, typed serialization boundary. That is one of the most important architectural shifts in production LLM systems.

Wiring It Together: Two Approaches

At the schema layer, we have everything we need. Our Pydantic model now gives us automatic JSON Schema generation, validation during deserialization, and a clean serialization boundary between the LLM and the application. The remaining question is straightforward: how do we connect this schema to the actual model call?

Today, there are two clean ways to do that. Both completely eliminate the hand-rolled parser from the beginning of this article. The right choice depends mostly on your stack and how much orchestration you need.

Option A: Native SDK (Recommended for single-provider stacks)

For teams using only OpenAI or Azure OpenAI, the native SDK is often the simplest and cleanest approach. Recent SDK versions allow a Pydantic model to be passed directly as the response format, which means schema conversion, validation, and deserialization are handled internally.

from openai import OpenAI
from pydantic import BaseModel, Field
from enum import Enum

class Genre(str, Enum):
ACTION = "action"
COMEDY = "comedy"
SCI_FI = "sci-fi"
DRAMA = "drama"

class MovieRecommendation(BaseModel):
title: str = Field(description="Full movie title")
year: int = Field(ge=1888, le=2030, description="Release year")
genre: Genre = Field(description="Primary genre")

client = OpenAI()

response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[{
"role": "user",
"content": "Recommend a sci-fi movie from the 2000s"
}],
response_format=MovieRecommendation,
)

movie = response.choices[0].message.parsed
print(movie)

The important point here is that .parsed is already a fully validated Pydantic object. There is no json.loads(), no fence stripping, and no manual key normalization.

This pattern is no longer limited to OpenAI. Claude, Gemini, and Grok now all provide native structured output capabilities in their respective SDKs, including schema-based validation and typed responses.

If your system is built around a single provider, the native SDK is usually the most ergonomic option. However, if you expect provider switching, multi-step chains, retrieval pipelines, or agent workflows, LangChain’s .with_structured_output() becomes the stronger abstraction because it normalizes provider-specific differences behind a single interface.

Option B: LangChain .with_structured_output()

If your stack already uses LangChain — or you anticipate multi-step workflows, retrieval pipelines, or multi-provider support — LangChain provides a clean abstraction on top of the same Pydantic foundation.

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
model="gpt-4o",
temperature=0
)

structured_llm = llm.with_structured_output(MovieRecommendation)

movie = structured_llm.invoke(
"Recommend a sci-fi movie from the 2000s"
)

print(movie)

The output is still the same validated domain object. The difference is architectural: LangChain makes it easier to compose structured output into larger systems such as retrieval pipelines, agents, or multi-step reasoning workflows.

The key shift remains the same. We are no longer receiving text and attempting to reverse-engineer structure after the fact. Instead, the model output enters the application as a typed object that already satisfies the schema contract.

A More Realistic Example: Support Ticket Extraction

Movie recommendations are useful for illustrating the mechanics, but production systems rarely stop at toy examples. The real value of structured output becomes much clearer in extraction workflows such as support email processing.

Consider a support email pipeline:

class Priority(str, Enum):
LOW = "low"
MEDIUM = "medium"
HIGH = "high"
URGENT = "urgent"

class SupportTicket(BaseModel):
subject: str
priority: Priority
product: str
is_billing_issue: bool
customer_sentiment: float = Field(ge=-1.0, le=1.0)
action_items: list[str]

The extraction step is now reduced to a schema-bound model call:

ticket_llm = llm.with_structured_output(SupportTicket)

ticket = ticket_llm.invoke(
f"Extract structured data from this support email:\n{email}"
)

The returned object can flow directly into downstream systems such as ticketing workflows, API responses, analytics pipelines, or database inserts. Instead of parsing a paragraph to infer urgency, billing status, sentiment, and next steps, the application receives a validated object with correct types and constraints already enforced. Structured output is no longer just a developer convenience; in production workflows, it becomes an operational requirement.

Production Hardening

Once structured output is part of a production system, observability becomes just as important as validation. One of the most useful patterns here is returning the raw response alongside the parsed object:

structured_llm = llm.with_structured_output(
SupportTicket,
include_raw=True
)

Validation failures should be treated as observable signals rather than opaque exceptions. When the same field fails repeatedly, it usually points to one of three issues: an unclear field description, overly strict constraints, or a schema shape that needs refinement.

For workflows that directly feed downstream systems, strict mode should generally be the default:

structured_llm = llm.with_structured_output(
SupportTicket,
method="json_schema",
strict=True
)

With strict=True, schema enforcement moves into the generation process itself instead of relying only on post-generation validation. It adds roughly 5–15% latency but virtually eliminates downstream errors — a trade-off that is almost always worth it for production data.

For ultra-critical workflows, wrap the call in a simple retry that falls back to plain json_object mode after two failures. This gives you the best of both worlds: maximum reliability with a safety net.

Design Principles That Actually Matter

After working through the full lifecycle, a few practical rules consistently matter in production:

  • every field should have a clear description
  • use enums whenever values are constrained
  • add numeric bounds wherever possible
  • keep schemas flat unless nesting is necessary
  • log every validation failure

These are not implementation details. They directly improve reliability and make downstream systems safer.

Closing Thoughts: From Text to Application Boundaries

The real challenge in production LLM systems is not generating text — it is turning probabilistic model output into reliable application data.

We started with json.loads(), a few assumptions, and fast-moving demo code. That approach works for a quick prototype, but it breaks the moment output feeds downstream systems.

Structured output changes that model entirely.

Once we define schemas first and validate at the boundary, model output stops behaving like unstructured text and starts behaving like a real interface contract.

That shift, from best-effort parsing to schema-first engineering, is what turns an LLM workflow into an production-grade system.


Structured Output for LLMs in Production: From json.loads() to Validated Objects was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top