cs.AI, cs.CV

Cross-Modal Attention Analysis and Optimization in Vision-Language Models: A Study on Visual Reliability

arXiv:2604.17217v1 Announce Type: new
Abstract: Vision-Language Models (VLMs) achieve strong cross-modal performance, yet recent evidence suggests they over-rely on textual descriptions while under-utilizing visual evidence — a phenomenon termed “te…