cs.CL, cs.LG

Compute Where it Counts: Self Optimizing Language Models

arXiv:2605.10875v1 Announce Type: cross
Abstract: Efficient LLM inference research has largely focused on reducing the cost of each decoding step (e.g., using quantization, pruning, or sparse attention), typically applying a uniform computation budget…