HoPE: 視覚言語モデルにおける長さ一般化のための位置埋め込みハイブリッド

要旨

ビジョン・ランゲージモデル（VLMs）は、マルチモーダルタスクにおいて大きな進展を遂げてきました。しかし、その性能は長文脈シナリオ、特に長い動画においてしばしば低下します。ロータリーポジション埋め込み（RoPE）は大規模言語モデル（LLMs）における長文脈一般化のために広く採用されていますが、動画の複雑な時空間依存関係を捉えるために基本的なRoPEを拡張することは未解決の課題です。既存の手法では、通常、RoPE内の異なる周波数を割り当てて3D位置情報をエンコードします。しかし、これらの割り当て戦略は主にヒューリスティックに依存しており、深い理論的分析が欠けています。本論文では、まず異なる割り当て戦略がVLMsの長文脈能力にどのように影響するかを調査します。我々の分析によると、現在のマルチモーダルRoPEは、長い文脈にわたる意味的類似性を確実に捉えることができません。この問題を解決するために、我々はHoPE（Hybrid of Position Embedding）を提案します。HoPEは、任意の長さの文脈にわたる信頼性の高い意味モデリングのためのハイブリッド周波数割り当て戦略と、多様な文脈長にわたる堅牢な学習と柔軟な推論を促進する動的時間スケーリングメカニズムを導入します。長い動画の理解と検索タスクにおける4つのベンチマークでの広範な実験により、HoPEが既存の手法を一貫して上回ることを示し、その有効性を確認しました。コードはhttps://github.com/hrlics/HoPEで公開されています。

English

Vision-Language Models (VLMs) have made significant progress in multimodal tasks. However, their performance often deteriorates in long-context scenarios, particularly long videos. While Rotary Position Embedding (RoPE) has been widely adopted for length generalization in Large Language Models (LLMs), extending vanilla RoPE to capture the intricate spatial-temporal dependencies in videos remains an unsolved challenge. Existing methods typically allocate different frequencies within RoPE to encode 3D positional information. However, these allocation strategies mainly rely on heuristics, lacking in-depth theoretical analysis. In this paper, we first study how different allocation strategies impact the long-context capabilities of VLMs. Our analysis reveals that current multimodal RoPEs fail to reliably capture semantic similarities over extended contexts. To address this issue, we propose HoPE, a Hybrid of Position Embedding designed to improve the long-context capabilities of VLMs. HoPE introduces a hybrid frequency allocation strategy for reliable semantic modeling over arbitrarily long context, and a dynamic temporal scaling mechanism to facilitate robust learning and flexible inference across diverse context lengths. Extensive experiments across four video benchmarks on long video understanding and retrieval tasks demonstrate that HoPE consistently outperforms existing methods, confirming its effectiveness. Code is available at https://github.com/hrlics/HoPE.

HoPE: 視覚言語モデルにおける長さ一般化のための位置埋め込みハイブリッド

HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models

要旨

Support