CREG: Compass Relational Evidence Graph for Characterizing Directional Structure in VLM Spatial-Reasoning Attribution

arXiv:2603.20475v2 Announce Type: replace Abstract: Vision-language models (VLMs) can answer spatial relation queries, yet a correct answer does not reveal whether the model truly uses directional evidence or merely exploits object layout. We present CREG (Compass Relational Evidence Graph), a training-free diagnostic framework that converts any token-level attribution map into a reference-parameterized compass distribution and evaluates it with Direction Alignment Error (DAE) and Edge Accuracy (EA). Across three VLMs and two primary benchmarks with native boxes (COCO-Pairs and VG-Spatial), plus supplementary VSR, CREG enables direct comparison of heterogeneous attribution methods on a shared directional scale; Chefer et al. is usually the strongest plug-in, indicating that the framework is not tied to our contrastive Grad-Act signal. Using CREG to probe VLM spatial attribution, we find that attribution is largely layout-driven: changing the queried direction leaves compass outputs near random, and re-centering the projection provides no advantage for the true reference origin. At the same time, CREG detects a limited residual directional component once image identity is controlled. This residual structure is practically useful: lower DAE predicts VLM correctness (AUC up to 0.65) and supports selective prediction and test-time re-ranking, improving accuracy by 14.0 percentage points on COCO-Pairs. CREG provides a unified way to measure directional organization in VLM attribution, making layout bias and residual relational signal explicit and quantifiable.

Leave a Comment