HoPE: 비전-언어 모델의 길이 일반화를 위한 하이브리드 위치 임베딩

초록

비전-언어 모델(VLMs)은 멀티모달 작업에서 상당한 진전을 이루었습니다. 그러나 이러한 모델들은 장문맥 시나리오, 특히 긴 비디오에서 성능이 저하되는 경우가 많습니다. 로터리 위치 임베딩(RoPE)이 대형 언어 모델(LLMs)의 길이 일반화를 위해 널리 채택되고 있지만, 기본 RoPE를 확장하여 비디오의 복잡한 시공간적 의존성을 포착하는 것은 여전히 해결되지 않은 과제로 남아 있습니다. 기존 방법들은 일반적으로 RoPE 내에서 서로 다른 주파수를 할당하여 3D 위치 정보를 인코딩합니다. 그러나 이러한 할당 전략은 주로 경험적 방법에 의존하며, 깊이 있는 이론적 분석이 부족합니다. 본 논문에서는 먼저 다양한 할당 전략이 VLMs의 장문맥 능력에 미치는 영향을 연구합니다. 우리의 분석은 현재의 멀티모달 RoPE들이 장문맥에서 신뢰할 수 있는 의미적 유사성을 포착하지 못한다는 것을 보여줍니다. 이 문제를 해결하기 위해, 우리는 VLMs의 장문맥 능력을 향상시키기 위해 설계된 하이브리드 위치 임베딩(HoPE)을 제안합니다. HoPE는 임의의 길이의 문맥에서 신뢰할 수 있는 의미적 모델링을 위한 하이브리드 주파수 할당 전략과 다양한 문맥 길이에서 강력한 학습과 유연한 추론을 촉진하기 위한 동적 시간 스케일링 메커니즘을 도입합니다. 긴 비디오 이해 및 검색 작업에 대한 네 가지 비디오 벤치마크에서의 광범위한 실험을 통해 HoPE가 기존 방법들을 일관되게 능가하며 그 효과를 입증합니다. 코드는 https://github.com/hrlics/HoPE에서 확인할 수 있습니다.

English

Vision-Language Models (VLMs) have made significant progress in multimodal tasks. However, their performance often deteriorates in long-context scenarios, particularly long videos. While Rotary Position Embedding (RoPE) has been widely adopted for length generalization in Large Language Models (LLMs), extending vanilla RoPE to capture the intricate spatial-temporal dependencies in videos remains an unsolved challenge. Existing methods typically allocate different frequencies within RoPE to encode 3D positional information. However, these allocation strategies mainly rely on heuristics, lacking in-depth theoretical analysis. In this paper, we first study how different allocation strategies impact the long-context capabilities of VLMs. Our analysis reveals that current multimodal RoPEs fail to reliably capture semantic similarities over extended contexts. To address this issue, we propose HoPE, a Hybrid of Position Embedding designed to improve the long-context capabilities of VLMs. HoPE introduces a hybrid frequency allocation strategy for reliable semantic modeling over arbitrarily long context, and a dynamic temporal scaling mechanism to facilitate robust learning and flexible inference across diverse context lengths. Extensive experiments across four video benchmarks on long video understanding and retrieval tasks demonstrate that HoPE consistently outperforms existing methods, confirming its effectiveness. Code is available at https://github.com/hrlics/HoPE.

HoPE: 비전-언어 모델의 길이 일반화를 위한 하이브리드 위치 임베딩

HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models

초록

Support