MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models
arXiv:2604.12537v1 Announce Type: cross
Abstract: Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet their positional encoding mechanisms remain suboptimal. Existing approaches uniformly assign positional …