Llama.cpp, opencode / pi / basically all agents, context compaction & cache validation: how do you manage it?

Ok so, I will try to explain myself as much as possible because onlinew I really cannot find much about this.

Let's start by my settings for running Qwen 3.6 35B:

Qwen 3.6: cmd: '/X --port ${PORT} --chat-template-kwargs '{"preserve_thinking": true}' --host 0.0.0.0 -m "/X/Qwen3.6-35B-A3B-Q6_K-00001-of-00002.gguf" --spec-type ngram-mod --spec-ngram-mod-n-match 24 --spec-ngram-mod-n-min 12 --spec-ngram-mod-n-max 48 --fit on -t 16 --fit-ctx 230000 --fit-target 256 --temp 0.7 --min-p 0.0 --top-p 0.95 --top-k 20 --jinja --no-mmproj --no-mmap -np 1 --presence-penalty 0.0 --repeat-penalty 1.0 --chat-template-file "/X/qwen3.6.jinja" -ub 4096 -b 8192'

And this is my setup:

AMD 5800X
96GB DDR4 3333 MHz
RX 6800XT 16GB
Ubuntu 26.04 running locally compiled llama.cpp with ROCM 7.2.2

Qwen 3.6 35B is THE model that finally allows me to use local AI in a professional setting, because it works very well with pi or opencode and it's plenty fast for me! (1000+ tps on prompt processing, 15 to 22 on token generation).

This is at least until I fill up my context. Which is also sadly very, very often.

one issue I noticed with ALL coding agents, be it kilo, opencode, pi, is that NONE of them are able to do context compaction without causing a full prompt reprocessing and complete invalidation of the entire cache, which, even at 1000+ tps, is still a LOT of time to wait for 200+k tokens worth of context to compact.

So, what am I missing? Have you also had this issue? If so, how did you solve it?

Hope this will bring out a solution to this obscure issue!

submitted by /u/ps5cfw
[link] [comments]

Leave a Comment