视觉语言模型中多模态位置编码的再审视
Revisiting Multimodal Positional Encoding in Vision-Language Models
October 27, 2025
作者: Jie Huang, Xuejing Liu, Sibo Song, Ruibing Hou, Hong Chang, Junyang Lin, Shuai Bai
cs.AI
摘要
多模态位置编码对视觉语言模型至关重要,然而目前对多模态位置编码的系统性研究尚显不足。本文通过分析多模态旋转位置嵌入(RoPE)的两个核心组件——位置设计与频率分配,展开了全面研究。通过大量实验,我们总结出三大关键准则:位置连贯性、全频段利用以及文本先验保持,这些准则分别确保了布局明确性、表征丰富性以及预训练大语言模型知识的忠实迁移。基于这些发现,我们提出了多头旋转位置嵌入(MHRoPE)和交错式多模态旋转位置嵌入(MRoPE-I)两种即插即用的简易变体,无需改变模型架构。在多样化基准测试中,我们的方法始终优于现有方案,在通用多模态理解和细粒度多模态理解任务上均取得显著提升。代码将发布于https://github.com/JJJYmmm/Multimodal-RoPEs。
English
Multimodal position encoding is essential for vision-language models, yet
there has been little systematic investigation into multimodal position
encoding. We conduct a comprehensive analysis of multimodal Rotary Positional
Embedding (RoPE) by examining its two core components: position design and
frequency allocation. Through extensive experiments, we identify three key
guidelines: positional coherence, full frequency utilization, and preservation
of textual priors-ensuring unambiguous layout, rich representation, and
faithful transfer from the pre-trained LLM. Based on these insights, we propose
Multi-Head RoPE (MHRoPE) and MRoPE-Interleave (MRoPE-I), two simple and
plug-and-play variants that require no architectural changes. Our methods
consistently outperform existing approaches across diverse benchmarks, with
significant improvements in both general and fine-grained multimodal
understanding. Code will be avaliable at
https://github.com/JJJYmmm/Multimodal-RoPEs.