Not All Thoughts Need HBM: Semantics-Aware Memory Hierarchy for LLM Reasoning
arXiv:2605.09490v1 Announce Type: new
Abstract: Reasoning LLMs produce thousands of chain-of-thought tokens whose KV cache must reside in scarce GPU HBM. The dominant response — permanently evicting low-importance tokens — is catastrophic for reason…