Yogesh Kulkarni, Pooyan Fazli

AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video

Yogesh Kulkarni, Pooyan Fazli / March 31, 2026

arXiv:2508.03100v4 Announce Type: replace
Abstract: Multimodal reasoning over long-horizon video is challenging due to the need for precise spatiotemporal fusion and alignment across modalities. While recent methods such as Group Relative Policy Optim…

Author name: Yogesh Kulkarni, Pooyan Fazli

AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video