cs.CV

Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation

arXiv:2510.19592v2 Announce Type: replace
Abstract: Multimodal large language models (MLLMs) demonstrate strong video understanding by attending to visual tokens relevant to textual queries. To directly adapt this for localization in a training-free m…