cs.CV

Reinforcing Consistency in Video MLLMs with Structured Rewards

arXiv:2604.01460v1 Announce Type: new
Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress in video understanding. However, seemingly plausible outputs often suffer from poor visual and temporal grounding: a model may f…