OpenClaw + oMLX shows 0 cached tokens, but Hermes uses cache fine with the same local model, what am I missing?

Hey everyone, I’m trying to debug a weird prompt cache issue with OpenClaw + oMLX, and I’d appreciate help from anyone running local agents on MLX/oMLX.

The short version is this:

I’m running oMLX v0.3.8 on my Mac, serving:

Qwen3.6-35B-A3B-RotorQuant-MLX-4bit

OpenClaw runs in Docker on my NAS and connects to oMLX through Tailscale / Docker extra host:

http://cerebro:8080/v1

Hermes WebUI / Hermes Agent also uses the same oMLX server and same model, and cache works fine there. So I don’t think this is simply “Qwen can’t cache” or “oMLX cache is broken”.

But when OpenClaw uses the model, oMLX shows:

Cached Tokens: 0 Cache Efficiency: 0.0% Total Prefill Tokens keeps increasing Runtime Cache Observability has cache files, about 16GB+ 

So oMLX clearly has cache files, but OpenClaw requests seem to be missing cache reuse completely.

I already tested oMLX directly with repeated identical requests to /v1/chat/completions, and cache works. Example:

Request 1: prompt_tokens: 63020 cached_tokens: 14336 Request 2: prompt_tokens: 63020 cached_tokens: 61440 Request 3: prompt_tokens: 63020 cached_tokens: 61440 

So direct oMLX cache works. Hermes also seems to benefit from cache at 93%. OpenClaw is the one that keeps re-prefilling.

My OpenClaw provider config currently looks like this, simplified and redacted:

"models": { "mode": "merge", "providers": { "omlx": { "baseUrl": "http://cerebro-mac:8080/v1", "apiKey": "1234", "api": "openai-completions", "timeoutSeconds": 140000, "models": [ { "id": "local_model", "name": "oMLX local_model", "reasoning": true, "input": ["text"], "contextWindow": 260000, "maxTokens": 32768, "compat": { "supportsPromptCacheKey": true }, "params": { "cacheRetention": "long" } } ] } } } 

And under agents.defaults I have:

"model": { "primary": "omlx/local_model", "fallbacks": [] }, "contextInjection": "continuation-skip", "params": { "cacheRetention": "long" }, "contextPruning": { "mode": "cache-ttl", "ttl": "120m" } 

I also tried openai-responses briefly, but I’m not sure whether oMLX wants:

http://cerebro:8080/v1 

or:

http://cerebro:8080 

for Responses-style calls. OpenClaw docs mention prompt_cache_key for OpenAI-compatible providers when compat.supportsPromptCacheKey is set, but I’m not sure if OpenClaw is actually sending it to oMLX in my setup.

Things I found while researching:

  • OpenClaw has docs for cacheRetention, contextPruning.mode: "cache-ttl", and compat.supportsPromptCacheKey.
  • There was an OpenClaw issue saying 2026.2.15 broke prompt cache for local providers like LM Studio / MLX / llama-server, apparently fixed later by moving volatile IDs out of the system prompt.
  • mlx-lm has an issue about Qwen3.5 caching, hybrid/SSM layers, thinking tokens, and tool rendering causing full prompt reprocessing.
  • But again, direct oMLX and Hermes cache perfectly fine for me. OpenClaw is the outlier.

I’m not looking to change models yet, because Hermes works fine with cache on the same oMLX server. I want to understand what OpenClaw is doing differently and how to configure or patch it correctly.

Any help would be appreciated, especially from anyone using:

OpenClaw + oMLX OpenClaw + LM Studio MLX OpenClaw + Qwen3.5/Qwen3.6 OpenClaw local model providers with prompt caching 

Happy to share sanitized config/logs if needed!

submitted by /u/juaps
[link] [comments]

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top