Haibin He, Maoyuan Ye, Jing Zhang, Juhua Liu, Bo Du

VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA

Haibin He, Maoyuan Ye, Jing Zhang, Juhua Liu, Bo Du / May 7, 2026

arXiv:2605.04870v1 Announce Type: new
Abstract: Video text-based visual question answering (Video TextVQA) aims to answer questions by reasoning over visual textual content appearing in videos. Despite the strong multimodal video understanding capabil…

Author name: Haibin He, Maoyuan Ye, Jing Zhang, Juhua Liu, Bo Du

VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA