Compressing Sequences in the Latent Embedding Space: $K$-Token Merging for Large Language Models
arXiv:2604.15153v2 Announce Type: replace-cross
Abstract: Large Language Models (LLMs) incur significant computational and memory costs when processing long prompts, as full self-attention scales quadratically with input length. Token compression aims…