cs.AI, cs.CL

The Diminishing Returns of Early-Exit Decoding in Modern LLMs

arXiv:2603.23701v1 Announce Type: cross
Abstract: In Large Language Model (LLM) inference, early-exit refers to stopping computation at an intermediate layer once the prediction is sufficiently confident, thereby reducing latency and cost. However, re…