Gemma 4 26B A4B is still fully capable at 245283/262144 (94%) contex !

https://preview.redd.it/x4nv3btr0kug1.png?width=1919&format=png&auto=webp&s=3c4cdda920a1cb74407e9292acb5bbeccea3bb5f

It solved an issue with a script that pulls real-time data from NVIDIA SMI; Gemini 3.1 actually failed to fix it even in a fresh session, lol.

It’s kind of mind-blowing how in 2026 we already have stable local models with 200k+ context! I tested it out by feeding it as many Reddit posts, random documentation files, and raw files from the llama.cpp repo as possible to bump the usage up and see how it affects my VRAM. Even during this testing, Gemma kept its mind intact! At 245,283 / 262,144 (94%) context, if I ask it what a specific user said, it matches perfectly and answers within 2–5 seconds.

245283/262144 (94%) at this contex , if i ask it to tell me what this user said and perfectly matches it and tells me , within 2-5 seconds

https://preview.redd.it/fo0myzkp1kug1.png?width=831&format=png&auto=webp&s=2b46c5ef672138c20c7e0e5ca85814569112ec0e

From previous tests, I found I had to decrease the temperature and bump the repeat penalty to 1.17/1.18 so it doesn't fall into a loop of self-questioning. Above 100k context, it used to start looping through its own thoughts and arguing; instead of providing a final answer, it would just go on forever. These settings helped a lot!

I'm using the latest llama.cpp (which gets updates almost every hour) and the latest Unsloth GGUF from 2–6 hours ago, so make sure to redownload!

Model : gemma-4-26B-A4B-it-UD-IQ4_NL.gguf , unsloth (unsloth bis)
These are my current settings for llama.ccp , that i start with pshel script :

# --- [2. OPTIMIZATION PARAMETERS] --- $ContextSize = "262144" $GpuLayers = "99" $Temperature = "0.7" $TopP = "0.95" $TopK = "40" $MinP = "0.05" $RepeatPenalty = "1.17" # --- [3. THE ARGUMENT CONSTRUCTION] --- $ArgumentList = @( "-m", $ModelPath, "--mmproj", $MMProjPath, "-ngl", $GpuLayers, "-c", $ContextSize, "-fa", "1", "--cache-ram", "2048", "-ctxcp", "2", "-ctk", "q8_0", "-b", "512", # Smaller batch for less activation overhead "-ub", "512", "-ctv", "q8_0", "--temp", $Temperature, "--top-p", $TopP, "--top-k", $TopK, "--min-p", $MinP, "--repeat-penalty", $RepeatPenalty, "--host", "0.0.0.0", "--port", "8080", "--jinja", "--metrics" )

What else i can test ? honestly i ran out of ideas to crash it! It just gulps and gulps whatever i throw at it

submitted by /u/cviperr33
[link] [comments]

Leave a Comment