视觉语言模型中多模态位置编码的再探讨

摘要

多模态位置编码对视觉语言模型至关重要，然而目前缺乏对其的系统性研究。本文通过分析旋转位置嵌入（RoPE）的两个核心要素——位置设计与频率分配，开展了全面的多模态RoPE研究。大量实验揭示了三大关键准则：位置连贯性、全频段利用以及文本先验保持——这确保了清晰的布局表征、丰富的语义表达以及预训练大语言模型知识的忠实迁移。基于这些发现，我们提出了多头RoPE（MHRoPE）与交错式多模态RoPE（MRoPE-I）两种即插即用型改进方案，无需改变模型结构。在多样化基准测试中，我们的方法始终优于现有技术，在通用多模态理解和细粒度语义理解任务上均取得显著提升。代码已开源于https://github.com/JJJYmmm/Multimodal-RoPEs。

English

Multimodal position encoding is essential for vision-language models, yet there has been little systematic investigation into multimodal position encoding. We conduct a comprehensive analysis of multimodal Rotary Positional Embedding (RoPE) by examining its two core components: position design and frequency allocation. Through extensive experiments, we identify three key guidelines: positional coherence, full frequency utilization, and preservation of textual priors-ensuring unambiguous layout, rich representation, and faithful transfer from the pre-trained LLM. Based on these insights, we propose Multi-Head RoPE (MHRoPE) and MRoPE-Interleave (MRoPE-I), two simple and plug-and-play variants that require no architectural changes. Our methods consistently outperform existing approaches across diverse benchmarks, with significant improvements in both general and fine-grained multimodal understanding. Code will be avaliable at https://github.com/JJJYmmm/Multimodal-RoPEs.

视觉语言模型中多模态位置编码的再探讨

Revisiting Multimodal Positional Encoding in Vision-Language Models

摘要

Support