記憶鞏固促進了長時序視頻理解。

摘要

大多數基於Transformer的視頻編碼器由於其二次複雜度而僅限於短暫的時間範疇。儘管已經做出各種嘗試來擴展這種範疇，但這通常是以概念和計算複雜度為代價。我們建議重新運用現有的預訓練視頻Transformer，通過簡單微調它們以關注從過去激活中非參數化衍生的記憶。通過利用冗餘減少，我們的記憶整合視覺Transformer（MC-ViT）輕鬆將其範疇延伸到過去，並在從更長的視頻中學習時展現出優秀的擴展行為。通過這樣做，MC-ViT在EgoSchema、Perception Test和Diving48的長範疇視頻理解方面創立了新的最先進技術，勝過那些受益於數量級更多參數的方法。

English

Most transformer-based video encoders are limited to short temporal contexts due to their quadratic complexity. While various attempts have been made to extend this context, this has often come at the cost of both conceptual and computational complexity. We propose to instead re-purpose existing pre-trained video transformers by simply fine-tuning them to attend to memories derived non-parametrically from past activations. By leveraging redundancy reduction, our memory-consolidated vision transformer (MC-ViT) effortlessly extends its context far into the past and exhibits excellent scaling behavior when learning from longer videos. In doing so, MC-ViT sets a new state-of-the-art in long-context video understanding on EgoSchema, Perception Test, and Diving48, outperforming methods that benefit from orders of magnitude more parameters.

記憶鞏固促進了長時序視頻理解。

Memory Consolidation Enables Long-Context Video Understanding

摘要

Support