cs.AR, cs.CL, cs.LG

Not All Thoughts Need HBM: Semantics-Aware Memory Hierarchy for LLM Reasoning

arXiv:2605.09490v1 Announce Type: new
Abstract: Reasoning LLMs produce thousands of chain-of-thought tokens whose KV cache must reside in scarce GPU HBM. The dominant response — permanently evicting low-importance tokens — is catastrophic for reason…