从语言到视觉的长距离上下文传递
Long Context Transfer from Language to Vision
June 24, 2024
作者: Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, Ziwei Liu
cs.AI
摘要
视频序列提供了宝贵的时间信息,但现有的大型多模态模型(LMMs)在理解极长视频方面存在不足。许多研究通过使用视觉重采样器来减少视觉标记的数量来解决这个问题。相反,在本文中,我们从语言模型的角度来解决这个问题。通过简单地推断语言主干的上下文长度,我们使LMMs能够理解数量级更多的视觉标记,而无需进行任何视频训练。我们将这种现象称为长上下文转移,并仔细剔除其属性。为了有效衡量LMMs在视觉模态中泛化到长上下文的能力,我们开发了V-NIAH(Visual Needle-In-A-Haystack),这是一个纯合成的长视觉基准,受到语言模型的NIAH测试的启发。我们提出的长视频助手(LongVA)可以处理2000帧或超过200K个视觉标记,而无需额外的复杂性。凭借其扩展的上下文长度,LongVA通过密集采样更多输入帧,在7B规模模型中在Video-MME上实现了最先进的性能。我们的工作在https://github.com/EvolvingLMMs-Lab/LongVA 上开源。
English
Video sequences offer valuable temporal information, but existing large
multimodal models (LMMs) fall short in understanding extremely long videos.
Many works address this by reducing the number of visual tokens using visual
resamplers. Alternatively, in this paper, we approach this problem from the
perspective of the language model. By simply extrapolating the context length
of the language backbone, we enable LMMs to comprehend orders of magnitude more
visual tokens without any video training. We call this phenomenon long context
transfer and carefully ablate its properties. To effectively measure LMMs'
ability to generalize to long contexts in the vision modality, we develop
V-NIAH (Visual Needle-In-A-Haystack), a purely synthetic long vision benchmark
inspired by the language model's NIAH test. Our proposed Long Video Assistant
(LongVA) can process 2000 frames or over 200K visual tokens without additional
complexities. With its extended context length, LongVA achieves
state-of-the-art performance on Video-MME among 7B-scale models by densely
sampling more input frames. Our work is open-sourced at
https://github.com/EvolvingLMMs-Lab/LongVA.Summary
AI-Generated Summary