Continuous Semantic Caching for Low-Cost LLM Serving
arXiv:2604.20021v1 Announce Type: new
Abstract: As Large Language Models (LLMs) become increasingly popular, caching responses so that they can be reused by users with semantically similar queries has become a vital strategy for reducing inference cos…