cs.CL, cs.LG

Continuous Semantic Caching for Low-Cost LLM Serving

arXiv:2604.20021v1 Announce Type: new
Abstract: As Large Language Models (LLMs) become increasingly popular, caching responses so that they can be reused by users with semantically similar queries has become a vital strategy for reducing inference cos…