Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes
arXiv:2605.08913v2 Announce Type: replace
Abstract: Autoregressive inference is typically assumed to scale predictably with decoding length, with latency increasing smoothly as generated sequence length grows. In this work, we identify unexpected non-…