cs.CL, cs.LG

Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM

arXiv:2505.05772v2 Announce Type: replace
Abstract: Transformer-based models are the foundation of modern machine learning, but their execution, particularly during autoregressive decoding in large language models (LLMs), places significant pressure o…