Turboderp has a been on an absolute tear recently, in the endless battle to cram new llamas into smaller, faster boxes.
We started off last month with the release of gemma 4 support, and continued with improved caching efficiency.
DFlash support came 2 weeks ago with these impressive results:
| Category | Baseline | N-gram/suffix | DFlash |
| Agentic, code | 55.98 t/s | 89.58 t/s (1.60x) | 140.61 t/s (2.51x) |
| Agentic, curl | 54.03 t/s | 74.62 t/s (1.38x) | 125.94 t/s (2.33x) |
| Coding | 59.21 t/s | 75.34 t/s (1.27x) | 177.67 t/s (3.00x) |
| Creative | 59.10 t/s | 67.26 t/s (1.13x) | 89.19 t/s (1.50x) |
| Creative (reasoning) | 59.03 t/s | 64.25 t/s (1.09x) | 93.54 t/s (1.58x) |
| Translation | 58.11 t/s | 55.39 t/s (0.95x) | 75.73 t/s (1.30x) |
| Translation (reasoning) | 58.08 t/s | 80.21 t/s (1.38x) | 119.43 t/s (2.06x) |
More model optimization last week, with these improvements:
| Model | 3090¹ | 4090¹ | 5090¹ | 6000 Pro¹ | 5090² | 6000 Pro² |
| Qwen3.5-35B-A3B 4.00bpw | 5.3% | 5.8% | 8.6% | 10.3% | 21.0% | 23.5% |
| Qwen3.5-27B 4.00bpw | 0.0% | 1.9% | 8.1% | 11.7% | 13.1% | 15.0% |
| Trinity-Nano 4.15bpw | 29.5% | 48.6% | 52.3% | 52.9% | 70.5% | 72.4% |
| Gemma4-26B-A4B 4.10bpw | 3.1% | 2.9% | 7.8% | 9.6% | 16.4% | 19.2% |
| Gemma4-31B 4.00bpw | 4.0% | 4.9% | 10.0% | 8.0% | 16.0% | 12.0% |
DFlash model quantization and more bugfixes + efficiency in the last 2 days, and more work on the dev branch already!
Come say hi at the exllama discord.
submitted by