HoPE: Hybride van Positie-Embedding voor Lengtegeneralizatie in Visie-Taalmodellen

Samenvatting

Vision-Language Models (VLMs) hebben aanzienlijke vooruitgang geboekt in multimodale taken. Hun prestaties verslechteren echter vaak in lang-context scenario's, met name bij lange video's. Hoewel Rotary Position Embedding (RoPE) veelvuldig wordt toegepast voor lengtegeneralizatie in Large Language Models (LLMs), blijft het uitbreiden van standaard RoPE om de complexe ruimtelijk-temporele afhankelijkheden in video's vast te leggen een onopgeloste uitdaging. Bestaande methoden wijzen doorgaans verschillende frequenties binnen RoPE toe om 3D-positionele informatie te coderen. Deze toewijzingsstrategieën zijn echter voornamelijk gebaseerd op heuristieken en missen diepgaande theoretische analyse. In dit artikel onderzoeken we eerst hoe verschillende toewijzingsstrategieën de lang-context mogelijkheden van VLMs beïnvloeden. Onze analyse toont aan dat huidige multimodale RoPE's niet betrouwbaar semantische overeenkomsten over langere contexten kunnen vastleggen. Om dit probleem aan te pakken, stellen we HoPE voor, een Hybrid of Position Embedding, ontworpen om de lang-context mogelijkheden van VLMs te verbeteren. HoPE introduceert een hybride frequentietoewijzingsstrategie voor betrouwbare semantische modellering over willekeurig lange contexten, en een dynamisch temporeel schaalmechanisme om robuust leren en flexibele inferentie over diverse contextlengtes te faciliteren. Uitgebreide experimenten over vier videobenchmarks voor lang video-begrip en retrievaltaken tonen aan dat HoPE consistent beter presteert dan bestaande methoden, wat de effectiviteit ervan bevestigt. Code is beschikbaar op https://github.com/hrlics/HoPE.

English

Vision-Language Models (VLMs) have made significant progress in multimodal tasks. However, their performance often deteriorates in long-context scenarios, particularly long videos. While Rotary Position Embedding (RoPE) has been widely adopted for length generalization in Large Language Models (LLMs), extending vanilla RoPE to capture the intricate spatial-temporal dependencies in videos remains an unsolved challenge. Existing methods typically allocate different frequencies within RoPE to encode 3D positional information. However, these allocation strategies mainly rely on heuristics, lacking in-depth theoretical analysis. In this paper, we first study how different allocation strategies impact the long-context capabilities of VLMs. Our analysis reveals that current multimodal RoPEs fail to reliably capture semantic similarities over extended contexts. To address this issue, we propose HoPE, a Hybrid of Position Embedding designed to improve the long-context capabilities of VLMs. HoPE introduces a hybrid frequency allocation strategy for reliable semantic modeling over arbitrarily long context, and a dynamic temporal scaling mechanism to facilitate robust learning and flexible inference across diverse context lengths. Extensive experiments across four video benchmarks on long video understanding and retrieval tasks demonstrate that HoPE consistently outperforms existing methods, confirming its effectiveness. Code is available at https://github.com/hrlics/HoPE.

HoPE: Hybride van Positie-Embedding voor Lengtegeneralizatie in Visie-Taalmodellen

HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models

Samenvatting

Support