Revisiting Multimodal Positional Encoding in Vision-Language Models
arXiv:2510.23095v3 Announce Type: replace
Abstract: Multimodal position encoding is essential for vision-language models, yet there has been little systematic investigation into multimodal position encoding. We conduct a comprehensive analysis of mult…