cs.CL, cs.CV

UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs

arXiv:2605.11856v1 Announce Type: cross
Abstract: Multimodal large language models are increasingly expected to perform thinking with images, yet existing visual latent reasoning methods still rely on explicit textual chain-of-thought interleaved with…