Wenjun Yu, Shuguang Han, Amelie Chi Zhou

One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving

Wenjun Yu, Shuguang Han, Amelie Chi Zhou / May 7, 2026

arXiv:2605.04450v1 Announce Type: cross
Abstract: Generative Recommender (GR) inference places embedding hot caches (EMB) and KV caches in direct competition for limited GPU HBM: allocating more memory to one improves its efficiency but degrades the o…

Author name: Wenjun Yu, Shuguang Han, Amelie Chi Zhou

One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving