One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving
arXiv:2605.04450v1 Announce Type: cross
Abstract: Generative Recommender (GR) inference places embedding hot caches (EMB) and KV caches in direct competition for limited GPU HBM: allocating more memory to one improves its efficiency but degrades the o…