LocalLLaMA

Found how to toggle reasoning mode for Gemma in LM-Studio!

/u/Adventurous-Paper566 / April 4, 2026

I’ve figured out how to trigger the reasoning process by adding "/think" to the system prompt. Heads up: the <|channel>thought tags have an unusual pipe (|) placement, which is why many LLM fail to parse the reasoning section corr…

LocalLLaMA

Tutorial – How to Toggle On/OFf the Thinking Mode Directly in LM Studio for Any Thinking Model

/u/Iory1998 / April 4, 2026

LM Studio is an exceptional tool for running local LLMs, but it has a specific quirk: the "Thinking" (reasoning) toggle often only appears for models downloaded directly through the LM Studio interface. If you use external GGUFs from pr…

LocalLLaMA

Recently I did a little performance test of several LLMs on PC with 16GB VRAM

/u/rosaccord / April 4, 2026

Qwen 3.5, Gemma-4, Nemotron Cascade 2 and GLM 4.7 flash. Tested to see how performance (speed) degrades with the context increase. used llama.cpp and some nice quants better fitting for 16GB VRAM in my RTX 4080. Here is a result comparison table….

LocalLLaMA

Running Gemma4 26B A4B on the Rockchip NPU using a custom llama.cpp fork. Impressive results for just 4W of power usage!

/u/Inv1si / April 4, 2026

submitted by /u/Inv1si [link] [comments]

LocalLLaMA

New to local AI. Best model recommendations for my specs?

/u/wunk0 / April 4, 2026

Hi everyone, I'm completely new to running AI models locally and would appreciate some guidance. Here are my specs: CPU: AMD Ryzen 9 5950X RAM: 16GB DDR4 GPU: NVIDIA RTX 4060 (8GB VRAM) I know my specs are pretty poor for running local AI, but I wa…

LocalLLaMA

Running Gemma 4 e4b (9.6GB RAM req) on RPi 5 8GB! Stable 2.8GHz Overclock & Custom Cooling

/u/AncientWin9492 / April 4, 2026

Finally got the Gemma 4 (E4B) model running on my Raspberry Pi 5 (8GB). Since the model requires about 9.6GB of RAM, I had to get creative with memory management. The Setup: Raspberry Pi OS. Lexar SSD (Essential for fast Swap). Memory Management:…

LocalLLaMA

Apple: Embarrassingly Simple Self-Distillation Improves Code Generation

/u/Mike_mi / April 4, 2026

submitted by /u/Mike_mi [link] [comments]

LocalLLaMA

Help running Qwen3-Coder-Next TurboQuant (TQ3) model

/u/UnluckyTeam3478 / April 4, 2026

I found a TQ3-quantized version of Qwen3-Coder-Next here: https://huggingface.co/edwardyoon79/Qwen3-Coder-Next-TQ3_0 According to the page, this model requires a compatible inference engine that supports TurboQuant. It also provides a command, but it d…

LocalLLaMA

Gemma4 issue with winogrande bench

/u/qdwang / April 4, 2026

gemma-4-26B-A4B-it-Q4_K_M can only get around 50% acc on winogrande-debiased-eval.csv with llama-perplexity. Meanwhile qwen3.5-35B-A3B-IQ4_NL can get about 75%+ acc. However, in real-world tasks, the Gemma 4 model performs very well. Why does this disc…

LocalLLaMA

Speed difference on Gemma 4 26B-A4B between Bartowski Q4_K_M and Unsloth Q4_K_XL

/u/BelgianDramaLlama86 / April 4, 2026

I've noticed this on Qwen3.5 35B before as well, there is a noticeable speed difference between Unsloth's Q4_K_XL and Bartowski's Q4_K_M on the same model, but Gemma 4 seems particularly harsh in this regard: Bartowski gets 38 tk/s, Unsloth…