CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization
arXiv:2605.08873v1 Announce Type: cross
Abstract: Group Relative Policy Optimization (GRPO) has emerged as a powerful algorithm for improving the reasoning capabilities of language models, but often fails to improve small models due to sparse rewards …