Gemma 4 E4B vs Qwen3.5-4B on document tasks: Qwen wins the benchmarks, but the sub-scores tell a different story

Results live here: https://www.idp-leaderboard.org/

Ran both through the IDP Leaderboard (OlmOCR Bench, OmniDocBench, IDP Core) and the headline numbers aren't the interesting part.

Top-line scores:

Benchmark	Gemma 4 E4B	Qwen3.5-4B
OlmOCR	47.0	75.4
OmniDoc	59.7	67.6
IDP Core	55.0	74.5

Qwen wins all three. On OlmOCR the gap is 28 points. Open and shut, right?

Not quite. Drill into IDP Core:

Sub-task	Gemma 4 E4B	Qwen3.5-4B
OCR (raw text recognition)	74.0	64.7
KIE (structured extraction)	11.1	86.0
Table	55.0	76.7
VQA	65.3	72.4

Gemma reads text from documents better than Qwen. It just can't do anything structured with what it reads. The KIE collapse (11.1 vs 86.0) isn't a vision failure, it's an instruction-following failure on schema-defined outputs (atleast thats what I'm guessing)

Same pattern in OlmOCR: Gemma scores 48.4 on H&F (handwriting/figures) vs Qwen's 47.2 essentially tied on the hardest visual subset. But Multi-Col is 37.1 vs 79.2. Multi-column layout needs compositional spatial reasoning, not just pixel-level reading.

Within the Gemma family, the E2B (2.3B effective) to E4B (4.5B effective) gap is steep: OlmOCR goes 38.2 → 47.0, OmniDoc 43.3 → 59.7. Worth knowing if you're considering the smaller variant.

Practical takeaways:

If you're running end-to-end extraction pipelines, Qwen3.5-4B is still the better pick at this size. But if you're preprocessing documents before passing to another model and you care about raw text fidelity over structured output, Gemma's perception quality is underrated.

Gemma might be actually better in handwriting recognition as thats what the OCR tasks resemble (Check this for example is one of the benchmark's OCR task: https://www.idp-leaderboard.org/explore/?model=Nanonets+OCR2%2B&benchmark=idp&task=OCR&sample=ocr\_handwriting\_3)

And lastly I felt Gemma is a reasoning powerhouse matching Qwen on VQA benchmark.

The other Gemma angle: E2B and E4B have native audio input baked into the model weights. No separate pipeline. For anyone building voice + document workflows at the edge, nothing else at this size does that.

One genuine problem right now: the 26B MoE variant is running ~11 tok/s vs Qwen 35B-A3B at 60+ tok/s on a 5060 Ti 16GB. Same hardware. The routing overhead is real. Dense 31B is more predictable (~18–25 tok/s on dual consumer GPUs), but the MoE speed gap is hard to ignore.

Anyone running these on real document workloads? Curious whether the KIE gap closes with structured prompting or if it's more fundamental.

submitted by /u/shhdwi
[link] [comments]

Leave a Comment