About the Model:
35B total parameters, 3B active (A3B) mixture of experts architecture.
Evaluation approach taken:
We took Q4_K_M quantized GGUF from Unsloth. Ran it on CPU via llama-cpp-python and tested on three standard benchmarks:
- HumanEval (code generation),
- HellaSwag (commonsense reasoning), and
- BFCL (function calling).
1,264 samples total.
Evaluation Results:
- HumanEval: 47.56% (78/164)
- HellaSwag: 74.30% (743/1000)
- BFCL: 46.00% (46/100)
Hardware:
32 vCPU, 125GB RAM. No GPU.
What This Means?
The Q4_K_M quantized variant runs at 22 tokens/sec on CPU delivering decent speed and performs best on commonsense reasoning at 74%. Code generation and function calling are harder tasks for this variant, landing in the mid 40s.
Overall these are solid results for an active 3B MoE model running quantized on CPU.
This entire evaluation was performed using Neo AI Engineer which researched various quant versions that could be run on the available CPU system and then using the correct chat template, building the consolidated eval harness for 3 benchmarks and reporting the final results after thorough review.
submitted by