Multi-Token Prediction (MTP) for LLaMA.cpp – Gemma 4 speedup by 40%

Multi-Token Prediction (MTP) for LLaMA.cpp - Gemma 4 speedup by 40%

Implemented Multi-Token Prediction for LLaMA.cpp.

Quantized Gemma 4 assistant models into GGUF format.

Ran tests on a MacBook Pro M5Max. Gemma 26B with MTP drafts tokens 40% faster.

Prompt: Write a Python program to find the nth Fibonacci number using recursion

Outputs:
LLaMA.cpp: 97 tokens/s
LLaMA.cpp + MTP: 138 tokens/s

Gemma4-assistant GGUF Quantized models: https://huggingface.co/collections/AtomicChat/gemma-4-assistant-gguf

Local AI models app: http://atomic.chat

Patched llama.cpp: https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant

submitted by /u/gladkos
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top