ChatPaper.aiChatPaper

记忆巩固实现了长上下文视频理解。

Memory Consolidation Enables Long-Context Video Understanding

February 8, 2024
作者: Ivana Balažević, Yuge Shi, Pinelopi Papalampidi, Rahma Chaabouni, Skanda Koppula, Olivier J. Hénaff
cs.AI

摘要

大多数基于Transformer的视频编码器由于其二次复杂度而仅限于短暂的时间上下文。虽然已经尝试过各种方法来扩展这种上下文,但通常会以概念和计算复杂性为代价。我们建议重新利用现有的预训练视频Transformer,通过简单微调使其关注从过去激活中非参数化衍生出的记忆。通过利用冗余减少,我们的记忆整合视觉Transformer(MC-ViT)轻松地将其上下文延伸到过去,并在从更长的视频中学习时表现出优秀的扩展行为。通过这样做,MC-ViT在EgoSchema、Perception Test和Diving48上实现了长上下文视频理解的最新技术水平,胜过那些受益于数量级更多参数的方法。
English
Most transformer-based video encoders are limited to short temporal contexts due to their quadratic complexity. While various attempts have been made to extend this context, this has often come at the cost of both conceptual and computational complexity. We propose to instead re-purpose existing pre-trained video transformers by simply fine-tuning them to attend to memories derived non-parametrically from past activations. By leveraging redundancy reduction, our memory-consolidated vision transformer (MC-ViT) effortlessly extends its context far into the past and exhibits excellent scaling behavior when learning from longer videos. In doing so, MC-ViT sets a new state-of-the-art in long-context video understanding on EgoSchema, Perception Test, and Diving48, outperforming methods that benefit from orders of magnitude more parameters.
PDF101December 15, 2024