Benchmarked Gemma 4 E2B: The 2B model beat every larger sibling on multi-turn (70%)

Tested Gemma 4 E2B across 10 enterprise task suites against Gemma 2 2B, Gemma 3 4B, Gemma 4 E4B, and Gemma 3 12B. Run locally on Apple Silicon.

Overall ranking (9 evaluable suites):

Gemma 4 E4B — 83.6%
Gemma 3 12B — 82.3%
Gemma 3 4B — 80.8%
Gemma 4 E2B — 80.4% ← new entry
Gemma 2 2B — 77.6%

Key E2B results:

Multi-turn: 70% (highest in family — beats every larger sibling)
Classification: 92.9% (tied with 4B and 12B)
Info Extraction F1: 80.2% (matches 12B)
Multilingual: 83.3%
Safety: 93.3% (100% prompt injection resistance)

Same parameter count, generational improvement (Gemma 2 2B → Gemma 4 E2B):

Multi-turn: 40% → 70% (+30)
RAG grounding: 33.3% → 50% (+17)
Function calling: 70% → 80% (+10)

7 of 8 suites improved at the same parameter count.

Function calling initially crashed our evaluator with TypeError: unhashable type: 'dict' — the model returned nested dicts where strings were expected. Third small-model evaluator bug I've found this year.

submitted by /u/Zealousideal-Yard328
[link] [comments]

Leave a Comment