UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs
arXiv:2605.11856v1 Announce Type: cross
Abstract: Multimodal large language models are increasingly expected to perform thinking with images, yet existing visual latent reasoning methods still rely on explicit textual chain-of-thought interleaved with…