Rate-Distortion Optimization for Transformer Inference
arXiv:2601.22002v2 Announce Type: replace
Abstract: Transformers achieve superior performance on many tasks, but impose heavy compute and memory requirements during inference. This inference can be made more efficient by partitioning the process acros…