LocalLLaMA

Unweight: how we compressed an LLM 22% without sacrificing quality

Summary: In LLM inference on modern GPUs (like the NVIDIA H100), the bottleneck is memory bandwidth, not computational speed. The time it takes to move model weights from the GPU's slow main memory (HBM) to its processing cores limits how fas…