HoPE：视觉语言模型中长度泛化的混合位置嵌入方法

摘要

视觉语言模型（VLMs）在多模态任务中取得了显著进展。然而，在处理长上下文场景，尤其是长视频时，其性能往往下降。尽管旋转位置编码（RoPE）已在大型语言模型（LLMs）的长度泛化中广泛应用，但将基础RoPE扩展以捕捉视频中复杂的时空依赖关系仍是一个未解难题。现有方法通常通过分配RoPE中的不同频率来编码三维位置信息，但这些分配策略主要依赖启发式方法，缺乏深入的理论分析。本文首先研究了不同分配策略如何影响VLMs的长上下文能力。我们的分析表明，当前的多模态RoPE无法在长上下文中可靠地捕捉语义相似性。为解决这一问题，我们提出了HoPE，一种混合位置编码，旨在提升VLMs的长上下文能力。HoPE引入了一种混合频率分配策略，以实现任意长上下文下的可靠语义建模，以及一种动态时间缩放机制，以促进跨多样上下文长度的鲁棒学习和灵活推理。在四个视频基准测试上的广泛实验，涉及长视频理解和检索任务，均显示HoPE持续优于现有方法，验证了其有效性。代码已发布于https://github.com/hrlics/HoPE。

English

Vision-Language Models (VLMs) have made significant progress in multimodal tasks. However, their performance often deteriorates in long-context scenarios, particularly long videos. While Rotary Position Embedding (RoPE) has been widely adopted for length generalization in Large Language Models (LLMs), extending vanilla RoPE to capture the intricate spatial-temporal dependencies in videos remains an unsolved challenge. Existing methods typically allocate different frequencies within RoPE to encode 3D positional information. However, these allocation strategies mainly rely on heuristics, lacking in-depth theoretical analysis. In this paper, we first study how different allocation strategies impact the long-context capabilities of VLMs. Our analysis reveals that current multimodal RoPEs fail to reliably capture semantic similarities over extended contexts. To address this issue, we propose HoPE, a Hybrid of Position Embedding designed to improve the long-context capabilities of VLMs. HoPE introduces a hybrid frequency allocation strategy for reliable semantic modeling over arbitrarily long context, and a dynamic temporal scaling mechanism to facilitate robust learning and flexible inference across diverse context lengths. Extensive experiments across four video benchmarks on long video understanding and retrieval tasks demonstrate that HoPE consistently outperforms existing methods, confirming its effectiveness. Code is available at https://github.com/hrlics/HoPE.

HoPE：视觉语言模型中长度泛化的混合位置嵌入方法

HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models

摘要

Support