Beyond Static Visual Tokens: Structured Sequential Visual Chain-of-Thought Reasoning
arXiv:2603.26737v1 Announce Type: new
Abstract: Current multimodal LLMs encode images as static visual prefixes and rely on text-based reasoning, lacking goal-driven and adaptive visual access. Inspired by human visual perception-where attention is se…