cs.CV

Vision-aligned Latent Reasoning for Multi-modal Large Language Model

arXiv:2602.04476v2 Announce Type: replace
Abstract: Despite recent advancements in Multi-modal Large Language Models (MLLMs) on diverse understanding tasks, these models struggle to solve problems which require extensive multi-step reasoning. This is …