prompt caching, but for rl training – 7.5x speedup on long-prompt/short-response workloads

most open source RL engines pack sequences naively: prompt + response, repeated for every sample in the group. this is fine for short prompt, long completion workloads but inefficient for long prompt, short completion workloads. with 1000-token prompts and 100-token responses at G=8, you're processing 8800 tokens when only 1800 are unique. about 5x wasted compute.

the fix is conceptually simple: compute the prompt once, then compute all G responses after it. it's analagous to inference prefix caching, except training needs gradients to flow back through the prompt, which breaks causal attention in the obvious implementation. getting it right required different tricks for full vs. linear attention layers.

you can read about it in the blogpost in the comments.

Numbers on Qwen3.5-4B:

- 16k prompt / 64 out → 7.5x

- 16k / 128 → 7.3x

- 16k / 1k → 5.4x

- 8k / 4k → 1.7x

submitted by /u/girishkumama
[link] [comments]

Leave a Comment