AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video
arXiv:2508.03100v4 Announce Type: replace
Abstract: Multimodal reasoning over long-horizon video is challenging due to the need for precise spatiotemporal fusion and alignment across modalities. While recent methods such as Group Relative Policy Optim…