cs.AI, cs.AR, cs.LG

FASQ: Flexible Accelerated Subspace Quantization for Calibration-Free LLM Compression

arXiv:2605.04084v1 Announce Type: new
Abstract: Compressing large language models (LLMs) for deployment on commodity GPUs remains challenging: conventional scalar quantization is limited to fixed bit-widths (e.g., 8/4/3-bit), offers only a few discret…