LVRPO: Language-Visual Alignment with GRPO for Multimodal Understanding and Generation
arXiv:2603.27693v1 Announce Type: cross
Abstract: Unified multimodal pretraining has emerged as a promising paradigm for jointly modeling language and vision within a single foundation model. However, existing approaches largely rely on implicit or in…