Responses Fall Short of Understanding: Revealing the Gap between Internal Representations and Responses in Visual Document Understanding
arXiv:2604.04411v1 Announce Type: cross
Abstract: Visual document understanding (VDU) is a challenging task for large vision language models (LVLMs), requiring the integration of visual perception, text recognition, and reasoning over structured layou…