LocalLLaMA

Qwen 3.5 397B vs Qwen 3.6-Plus

/u/LegacyRemaster / April 4, 2026

I see a lot of people worried about the possibility of QWEN 3.6 397b not being released. However, if I look at the small percentage of variation between 3.5 and 3.6 in many benchmarks, I think that simply quantizing 3.6 to "human" dimen…

LocalLLaMA

Speculative decoding works great for Gemma 4 31B in llama.cpp

/u/Leopold_Boom / April 4, 2026

I get a ~11% speed up with Gemma 3 270B as the draft model. Try it by adding: –no-mmproj -hfd unsloth/gemma-3-270m-it-GGUF:Q8_0 Testing with (on a 3090): ./build/bin/llama-cli -hf unsloth/gemma-4-31B-it-GGUF:Q4_1 –jinja –temp 1.0 –top-p 0.95 –top…

LocalLLaMA

Quantizers appriciation post

/u/Kahvana / April 4, 2026

Hey everyone, Yesterday I decided to try and learn how to quantize ggufs myself with reasonable quality, in order to understand the magic behind the curtain. Holy… I did not expect how much work it is, how long it takes, and requires A LOT (500GB!) o…

LocalLLaMA

Monarch v3: 78% Faster LLM Inference with NES-Inspired KV Paging

/u/Inevitable_Back3319 / April 4, 2026

Monarch v3: 78% Faster LLM Inference with NES-Inspired KV Paging TL;DR: We implemented NES-inspired memory paging for transformers. On a 1.1B parameter model, inference is now 78% faster (17.01 → 30.42 tok/sec) with nearly zero VRAM overhead. The algor…

LocalLLaMA

30 Days of Building a Small Language Model — Day 1: Neural Networks

/u/Prashant-Lakhera / April 4, 2026

Welcome to day one. Before I introduce tokenizers, transformers, or training loops, we start where almost all modern machine learning starts: the neural network. Think of the first day as laying down the foundation you will reuse for the next twe…

LocalLLaMA

We gave 12 LLMs a startup to run for a year. GLM-5 nearly matched Claude Opus 4.6 at 11× lower cost.

/u/DreadMutant / April 4, 2026

We built YC-Bench, a benchmark where an LLM plays CEO of a simulated startup over a full year (~hundreds of turns). It manages employees, picks contracts, handles payroll, and survives a market where ~35% of clients secretly inflate work requirem…

LocalLLaMA

Gemma 4 MoE hitting 120 TPS on Dual 3090s!

/u/AaZzEL / April 4, 2026

Thought I'd share some benchmark numbers from my local setup. Hardware: Dual NVIDIA RTX 3090s Model: Gemma 4 (MoE architecture) Performance: ~120 Tokens Per Second The efficiency of this MoE implementation is unreal. Even with a heavy load, the thr…

LocalLLaMA

FINALLY GEMMA 4 KV CACHE IS FIXED

/u/FusionCow / April 4, 2026

YESSS LLAMA.CPP IS UPDATED AND IT DOESN'T TAKE UP PETABYTES OF VRAM submitted by /u/FusionCow [link] [comments]

LocalLLaMA

running gemma 4 on my macbook air from 2020

/u/redilaify / April 4, 2026

i dont know what im doing with my life submitted by /u/redilaify [link] [comments]

LocalLLaMA

Running Llama2 Models in Vanilla Minecraft With Pure Commands

/u/This-Purchase-3325 / April 4, 2026

I made a program that converts any llama2 large language model into a minecraft datapack, and you can run inference right inside the game. It's still semi-finished, Currently I've only implemented argmax sampling, so the output tends to s…