HoPE：視覺語言模型中長度泛化的混合位置嵌入方法

摘要

視覺語言模型（VLMs）在多模態任務中取得了顯著進展。然而，在長上下文場景中，尤其是長視頻中，其性能往往會下降。雖然旋轉位置嵌入（RoPE）在大語言模型（LLMs）中已被廣泛採用以實現長度泛化，但將原始RoPE擴展以捕捉視頻中複雜的時空依賴性仍然是一個未解決的挑戰。現有方法通常分配RoPE中的不同頻率來編碼3D位置信息。然而，這些分配策略主要依賴於啟發式方法，缺乏深入的理論分析。在本文中，我們首先研究了不同的分配策略如何影響VLMs的長上下文能力。我們的分析表明，當前的多模態RoPE無法可靠地捕捉長上下文中的語義相似性。為了解決這個問題，我們提出了HoPE，一種混合位置嵌入，旨在提升VLMs的長上下文能力。HoPE引入了一種混合頻率分配策略，用於在任意長上下文中進行可靠的語義建模，以及一種動態時間縮放機制，以促進在不同上下文長度下的穩健學習和靈活推理。在四個視頻基準上的廣泛實驗，涵蓋長視頻理解和檢索任務，證明了HoPE始終優於現有方法，確認了其有效性。代碼可在https://github.com/hrlics/HoPE獲取。

English

Vision-Language Models (VLMs) have made significant progress in multimodal tasks. However, their performance often deteriorates in long-context scenarios, particularly long videos. While Rotary Position Embedding (RoPE) has been widely adopted for length generalization in Large Language Models (LLMs), extending vanilla RoPE to capture the intricate spatial-temporal dependencies in videos remains an unsolved challenge. Existing methods typically allocate different frequencies within RoPE to encode 3D positional information. However, these allocation strategies mainly rely on heuristics, lacking in-depth theoretical analysis. In this paper, we first study how different allocation strategies impact the long-context capabilities of VLMs. Our analysis reveals that current multimodal RoPEs fail to reliably capture semantic similarities over extended contexts. To address this issue, we propose HoPE, a Hybrid of Position Embedding designed to improve the long-context capabilities of VLMs. HoPE introduces a hybrid frequency allocation strategy for reliable semantic modeling over arbitrarily long context, and a dynamic temporal scaling mechanism to facilitate robust learning and flexible inference across diverse context lengths. Extensive experiments across four video benchmarks on long video understanding and retrieval tasks demonstrate that HoPE consistently outperforms existing methods, confirming its effectiveness. Code is available at https://github.com/hrlics/HoPE.

HoPE：視覺語言模型中長度泛化的混合位置嵌入方法

HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models

摘要

Support