Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation
arXiv:2510.19592v2 Announce Type: replace
Abstract: Multimodal large language models (MLLMs) demonstrate strong video understanding by attending to visual tokens relevant to textual queries. To directly adapt this for localization in a training-free m…