cs.AI, cs.CV

MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models

arXiv:2604.12537v1 Announce Type: cross
Abstract: Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet their positional encoding mechanisms remain suboptimal. Existing approaches uniformly assign positional …