Benchmarked Gemma 4 E2B: The 2B model beat every larger sibling on multi-turn (70%)

Tested Gemma 4 E2B across 10 enterprise task suites against Gemma 2 2B, Gemma 3 4B, Gemma 4 E4B, and Gemma 3 12B. Run locally on Apple Silicon.

Overall ranking (9 evaluable suites):

  • Gemma 4 E4B — 83.6%
  • Gemma 3 12B — 82.3%
  • Gemma 3 4B — 80.8%
  • Gemma 4 E2B — 80.4% ← new entry
  • Gemma 2 2B — 77.6%

Key E2B results:

  • Multi-turn: 70% (highest in family — beats every larger sibling)
  • Classification: 92.9% (tied with 4B and 12B)
  • Info Extraction F1: 80.2% (matches 12B)
  • Multilingual: 83.3%
  • Safety: 93.3% (100% prompt injection resistance)

Same parameter count, generational improvement (Gemma 2 2B → Gemma 4 E2B):

  • Multi-turn: 40% → 70% (+30)
  • RAG grounding: 33.3% → 50% (+17)
  • Function calling: 70% → 80% (+10)

7 of 8 suites improved at the same parameter count.

Function calling initially crashed our evaluator with TypeError: unhashable type: 'dict' — the model returned nested dicts where strings were expected. Third small-model evaluator bug I've found this year.

submitted by /u/Zealousideal-Yard328
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top