I've been running structured output prompts through a bunch of models on OpenRouter for the past few months — Llama 3, Mistral, Command R, DeepSeek, Qwen, and every other model on OpenRouter — alongside the usual closed-source suspects. 288 calls total. I wanted to know what actually breaks, how often, and whether open models fail differently from the API-only ones.
Short answer: not really. The failure modes are almost identical across the board. The rate varies — some models hit you with markdown fences on nearly every call, others only when you phrase the prompt a certain way; but the categories of breakage are the same everywhere.
What I saw most, roughly in order:
- Markdown fences wrapping the JSON (the model thinks it's being helpful)
- Trailing commas (JS habits from training data)
- Python
True/False/Noneinstead of JSONtrue/false/null - Truncated objects from running out of tokens mid-response
- Unescaped quotes inside string values
//or#comments inside JSON- Literal
...where the model got lazy and didn't generate all the data
The reason I'm posting here specifically: most of the advice I see for handling this is "just use JSON mode" or "use a constrained grammar." And yeah, those help when they're available. But a lot of what people run locally doesn't have reliable JSON mode, grammar-based generation has its own tradeoffs (speed, compatibility), and even when you do get syntactically valid JSON you can still get schema violations and truncation.
I ended up building a Python library (outputguard) that validates against JSON Schema and runs 15 repair strategies in a specific order when things break. The ordering part turned out to be more important than I expected: fixing encoding before structure, and re-parsing between each strategy so later fixes don't undo earlier ones.
Also handles YAML, TOML, and Python literals, which came up more than I thought it would once I started working with models that don't have a JSON mode and just output whatever format they feel like.
Wrote up the full findings in a blog post if anyone wants the details: What Breaks When You Ask an LLM for JSON
2,001 tests, MIT licensed, no LLM provider dependencies. pip install outputguard
Curious what other people's experience has been — are you seeing the same failure patterns, or are there models/quants that behave differently than what I'm describing?
[link] [comments]