LocalLLaMA

Gemma 4 fixes in llama.cpp

/u/jacek2023 / April 4, 2026

There have already been opinions that Gemma is bad because it doesn’t work well, but you probably aren’t using the transformers implementation, you’re using llama.cpp. After a model is released, you have to wait at least a few days for all the fixes in…

LocalLLaMA

Gemma 4 – 4B vs Qwen 3.5 – 9B ?

/u/No-Mud-1902 / April 4, 2026

Hello! anyone tried the 4B Gemma 4 model and the Qwen 3.5 9B model and can tell us their feedback? On the benchmark Qwen seems to be doing better, but I would appreciate any personal experience on the matter Thanks! submitted by /u/No-…

LocalLLaMA

Kokoro TTS running on-device, CPU-only, 20x realtime!!!

/u/aminsweiti / April 4, 2026

I wanted a reading app where you could read, read and listen or just listen to books with word-by-word highlighting synced to TTS and i wanted the voice to actually sound good. This turned out to be a really hard challenge with Kokoro on iOS, he…

LocalLLaMA

Qwen 3.5 397B vs Qwen 3.6-Plus

/u/LegacyRemaster / April 4, 2026

I see a lot of people worried about the possibility of QWEN 3.6 397b not being released. However, if I look at the small percentage of variation between 3.5 and 3.6 in many benchmarks, I think that simply quantizing 3.6 to "human" dimen…

LocalLLaMA

Speculative decoding works great for Gemma 4 31B in llama.cpp

/u/Leopold_Boom / April 4, 2026

I get a ~11% speed up with Gemma 3 270B as the draft model. Try it by adding: –no-mmproj -hfd unsloth/gemma-3-270m-it-GGUF:Q8_0 Testing with (on a 3090): ./build/bin/llama-cli -hf unsloth/gemma-4-31B-it-GGUF:Q4_1 –jinja –temp 1.0 –top-p 0.95 –top…

LocalLLaMA

Quantizers appriciation post

/u/Kahvana / April 4, 2026

Hey everyone, Yesterday I decided to try and learn how to quantize ggufs myself with reasonable quality, in order to understand the magic behind the curtain. Holy… I did not expect how much work it is, how long it takes, and requires A LOT (500GB!) o…

LocalLLaMA

Monarch v3: 78% Faster LLM Inference with NES-Inspired KV Paging

/u/Inevitable_Back3319 / April 4, 2026

Monarch v3: 78% Faster LLM Inference with NES-Inspired KV Paging TL;DR: We implemented NES-inspired memory paging for transformers. On a 1.1B parameter model, inference is now 78% faster (17.01 → 30.42 tok/sec) with nearly zero VRAM overhead. The algor…

LocalLLaMA

30 Days of Building a Small Language Model — Day 1: Neural Networks

/u/Prashant-Lakhera / April 4, 2026

Welcome to day one. Before I introduce tokenizers, transformers, or training loops, we start where almost all modern machine learning starts: the neural network. Think of the first day as laying down the foundation you will reuse for the next twe…

LocalLLaMA

We gave 12 LLMs a startup to run for a year. GLM-5 nearly matched Claude Opus 4.6 at 11× lower cost.

/u/DreadMutant / April 4, 2026

We built YC-Bench, a benchmark where an LLM plays CEO of a simulated startup over a full year (~hundreds of turns). It manages employees, picks contracts, handles payroll, and survives a market where ~35% of clients secretly inflate work requirem…

LocalLLaMA

Gemma 4 MoE hitting 120 TPS on Dual 3090s!

/u/AaZzEL / April 4, 2026

Thought I'd share some benchmark numbers from my local setup. Hardware: Dual NVIDIA RTX 3090s Model: Gemma 4 (MoE architecture) Performance: ~120 Tokens Per Second The efficiency of this MoE implementation is unreal. Even with a heavy load, the thr…