從語言到視覺的長距離上下文轉移
Long Context Transfer from Language to Vision
June 24, 2024
作者: Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, Ziwei Liu
cs.AI
摘要
影片序列提供寶貴的時間資訊,但現有的大型多模型(LMMs)在理解極長影片方面存在不足。許多研究通過使用視覺重採樣器來減少視覺標記數量來解決這個問題。相反,在本文中,我們從語言模型的角度來解決這個問題。通過簡單地擴展語言主幹的上下文長度,我們使LMMs能夠理解數量級更多的視覺標記,而無需進行任何影片訓練。我們稱這種現象為長上下文轉移,並仔細剔除其特性。為了有效衡量LMMs在視覺模態中對長上下文的泛化能力,我們開發了V-NIAH(Visual Needle-In-A-Haystack),這是一個純合成的長視覺基準,靈感來自語言模型的NIAH測試。我們提出的長影片助手(LongVA)可以處理2000幀或超過200K的視覺標記,而無需額外的複雜性。通過其擴展的上下文長度,LongVA在Video-MME中實現了7B規模模型中的最先進性能,通過密集採樣更多的輸入幀。我們的工作在https://github.com/EvolvingLMMs-Lab/LongVA上開源。
English
Video sequences offer valuable temporal information, but existing large
multimodal models (LMMs) fall short in understanding extremely long videos.
Many works address this by reducing the number of visual tokens using visual
resamplers. Alternatively, in this paper, we approach this problem from the
perspective of the language model. By simply extrapolating the context length
of the language backbone, we enable LMMs to comprehend orders of magnitude more
visual tokens without any video training. We call this phenomenon long context
transfer and carefully ablate its properties. To effectively measure LMMs'
ability to generalize to long contexts in the vision modality, we develop
V-NIAH (Visual Needle-In-A-Haystack), a purely synthetic long vision benchmark
inspired by the language model's NIAH test. Our proposed Long Video Assistant
(LongVA) can process 2000 frames or over 200K visual tokens without additional
complexities. With its extended context length, LongVA achieves
state-of-the-art performance on Video-MME among 7B-scale models by densely
sampling more input frames. Our work is open-sourced at
https://github.com/EvolvingLMMs-Lab/LongVA.Summary
AI-Generated Summary