Haruka Kawasaki, Ryota Tanaka, Kyosuke Nishida

Responses Fall Short of Understanding: Revealing the Gap between Internal Representations and Responses in Visual Document Understanding

Haruka Kawasaki, Ryota Tanaka, Kyosuke Nishida / April 7, 2026

arXiv:2604.04411v1 Announce Type: cross
Abstract: Visual document understanding (VDU) is a challenging task for large vision language models (LVLMs), requiring the integration of visual perception, text recognition, and reasoning over structured layou…

Author name: Haruka Kawasaki, Ryota Tanaka, Kyosuke Nishida

Responses Fall Short of Understanding: Revealing the Gap between Internal Representations and Responses in Visual Document Understanding