cs.CV

MTA-Agent: An Open Recipe for Multimodal Deep Search Agents

arXiv:2604.06376v1 Announce Type: new
Abstract: Multimodal large language models (MLLMs) have demonstrated strong capabilities in visual understanding, yet they remain limited in complex, multi-step reasoning that requires deep searching and integrati…