VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA
arXiv:2605.04870v1 Announce Type: new
Abstract: Video text-based visual question answering (Video TextVQA) aims to answer questions by reasoning over visual textual content appearing in videos. Despite the strong multimodal video understanding capabil…