LocalLLaMA

baidu/ERNIE-Image · Hugging Face

/u/adefa / April 14, 2026

submitted by /u/adefa [link] [comments]

Stop benchmarking inference providers, a guide to easy evaluation

/u/HauntingMoment / April 14, 2026

Hey ! Nathan from huggingface here, i maintained the Open LLM Leaderboard and in that time, I’ve evaluated around 10k model. I think there’s a pretty big misconception in how people benchmark LLMs. Most setups I see rely on inference providers like Ope…

LocalLLaMA

llama.cpp Vulkan backend requires SPIR-V headers package now

/u/fake_agent_smith / April 14, 2026

If you don't install SPIR-V headers it will no longer compile, keep that in mind: https://github.com/ggml-org/llama.cpp/pull/21572/changes#diff-43453f510556d352276e897e137cb103b3bbca24acb6cba33208d4887b2e3c77R497 submitted by /u…

LocalLLaMA

Elephant-alpha is Chinese? Don’t make me laugh…

/u/exceed_walker / April 14, 2026

Here's what I tested: Prompt: Provide a brief summary of the events in 1989, comparing the results in Europe versus Asia. Response: (a solid overview covering the major events) […] Fall of the Berlin Wall (Nov 9): The defining moment when East Ge…

LocalLLaMA

Q8 Cache

/u/Longjumping_Bee_6825 / April 14, 2026

https://github.com/ggml-org/llama.cpp/pull/21038 Since now cache quantization has better quality, does that mean Q8 cache is a good choice now? For example for 26B Gemma4? submitted by /u/Longjumping_Bee_6825 [link] [comments]

LocalLLaMA

1000 token/s, it’s blazing fast!!! Fairl

/u/Anxious_Basil8446 / April 14, 2026

submitted by /u/Anxious_Basil8446 [link] [comments]

LocalLLaMA

The LLM tunes its own llama.cpp flags (+54% tok/s on Qwen3.5-27B)

/u/raketenkater / April 14, 2026

This is V2 of my previous post. What's new: –ai-tune — the model starts tuning its own flags in a loop and caches the fastest config it finds. My weird rig: 3090 Ti + 4070 + 3060 + 128GB RAM. Model llama-server llm-server v1 tuning llm-server v2…

LocalLLaMA

Home-rolled loop agent is surprisingly effective

/u/DeltaSqueezer / April 14, 2026

I created a small demo to illustrate how agents work compared to a standard chat bot. Afterwards, I played with the simple loop and added 5 tools: grep, glob, read_file, write_file, edit_file and gave it a code editing task to see how it fared w…

LocalLLaMA

Gemma 4 31B — 4bit is all you need

/u/tolitius / April 14, 2026

Gemma quant comparison on M5 Max MacBook Pro 128GB (subjective of course, but on variety of categories): gemma 4 leaderboard the surprising bit: Gemma 4 31B 4bit scored higher than 8bit. 91.3% vs 88.4%. not sure why: could be the template, could …

LocalLLaMA

New method allows to convert auto-regressive models into diffusion models with a >2x speedup, fully compatible with existing inference stack

/u/Particular-Look-2640 / April 14, 2026

If the claims presented in the paper are true, this will be very big for multi-user local inference submitted by /u/Particular-Look-2640 [link] [comments]