Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL
arXiv:2604.28123v2 Announce Type: replace-cross
Abstract: The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (R…