Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference
arXiv:2604.02985v1 Announce Type: cross
Abstract: With the wide adoption of language models for IR — and specifically RAG systems — the latency of the underlying LLM becomes a crucial bottleneck, since the long contexts of retrieved passages lead la…