Hey everyone, I’m trying to debug a weird prompt cache issue with OpenClaw + oMLX, and I’d appreciate help from anyone running local agents on MLX/oMLX.
The short version is this:
I’m running oMLX v0.3.8 on my Mac, serving:
Qwen3.6-35B-A3B-RotorQuant-MLX-4bit
OpenClaw runs in Docker on my NAS and connects to oMLX through Tailscale / Docker extra host:
Hermes WebUI / Hermes Agent also uses the same oMLX server and same model, and cache works fine there. So I don’t think this is simply “Qwen can’t cache” or “oMLX cache is broken”.
But when OpenClaw uses the model, oMLX shows:
Cached Tokens: 0 Cache Efficiency: 0.0% Total Prefill Tokens keeps increasing Runtime Cache Observability has cache files, about 16GB+ So oMLX clearly has cache files, but OpenClaw requests seem to be missing cache reuse completely.
I already tested oMLX directly with repeated identical requests to /v1/chat/completions, and cache works. Example:
Request 1: prompt_tokens: 63020 cached_tokens: 14336 Request 2: prompt_tokens: 63020 cached_tokens: 61440 Request 3: prompt_tokens: 63020 cached_tokens: 61440 So direct oMLX cache works. Hermes also seems to benefit from cache at 93%. OpenClaw is the one that keeps re-prefilling.
My OpenClaw provider config currently looks like this, simplified and redacted:
"models": { "mode": "merge", "providers": { "omlx": { "baseUrl": "http://cerebro-mac:8080/v1", "apiKey": "1234", "api": "openai-completions", "timeoutSeconds": 140000, "models": [ { "id": "local_model", "name": "oMLX local_model", "reasoning": true, "input": ["text"], "contextWindow": 260000, "maxTokens": 32768, "compat": { "supportsPromptCacheKey": true }, "params": { "cacheRetention": "long" } } ] } } } And under agents.defaults I have:
"model": { "primary": "omlx/local_model", "fallbacks": [] }, "contextInjection": "continuation-skip", "params": { "cacheRetention": "long" }, "contextPruning": { "mode": "cache-ttl", "ttl": "120m" } I also tried openai-responses briefly, but I’m not sure whether oMLX wants:
http://cerebro:8080/v1 or:
http://cerebro:8080 for Responses-style calls. OpenClaw docs mention prompt_cache_key for OpenAI-compatible providers when compat.supportsPromptCacheKey is set, but I’m not sure if OpenClaw is actually sending it to oMLX in my setup.
Things I found while researching:
- OpenClaw has docs for
cacheRetention,contextPruning.mode: "cache-ttl", andcompat.supportsPromptCacheKey. - There was an OpenClaw issue saying
2026.2.15broke prompt cache for local providers like LM Studio / MLX / llama-server, apparently fixed later by moving volatile IDs out of the system prompt. mlx-lmhas an issue about Qwen3.5 caching, hybrid/SSM layers, thinking tokens, and tool rendering causing full prompt reprocessing.- But again, direct oMLX and Hermes cache perfectly fine for me. OpenClaw is the outlier.
I’m not looking to change models yet, because Hermes works fine with cache on the same oMLX server. I want to understand what OpenClaw is doing differently and how to configure or patch it correctly.
Any help would be appreciated, especially from anyone using:
OpenClaw + oMLX OpenClaw + LM Studio MLX OpenClaw + Qwen3.5/Qwen3.6 OpenClaw local model providers with prompt caching Happy to share sanitized config/logs if needed!
[link] [comments]