多尺度視頻預訓練用於長期活動預測
Multiscale Video Pretraining for Long-Term Activity Forecasting
July 24, 2023
作者: Reuben Tan, Matthias De Lange, Michael Iuzzolino, Bryan A. Plummer, Kate Saenko, Karl Ridgeway, Lorenzo Torresani
cs.AI
摘要
長期活動預測是一個特別具挑戰性的研究問題,因為它需要理解觀察到的行動之間的時間關係,以及人類活動的變異性和複雜性。儘管依賴昂貴的人類標註進行強監督,但最先進的預測方法通常對未見數據泛化能力差。為了緩解這個問題,我們提出了多尺度視頻預訓練(MVP),這是一種新穎的自監督預訓練方法,通過學習在多個時間尺度上預測未來視頻片段的情境化表示來學習為預測建立堅固的表示。MVP基於我們的觀察,視頻中的行動具有多尺度特性,其中原子行動通常發生在短時間尺度上,而更複雜的行動可能跨越較長的時間尺度。我們將MVP與最先進的自監督視頻學習方法在包括長期行動預期和視頻摘要預測在內的下游長期預測任務上進行比較。我們在Ego4D和Epic-Kitchens-55/100數據集上進行的全面實驗表明,MVP在性能上明顯優於最先進的方法。值得注意的是,MVP在視頻摘要預測方面相對性能提升超過20%的準確性,優於現有方法。
English
Long-term activity forecasting is an especially challenging research problem
because it requires understanding the temporal relationships between observed
actions, as well as the variability and complexity of human activities. Despite
relying on strong supervision via expensive human annotations, state-of-the-art
forecasting approaches often generalize poorly to unseen data. To alleviate
this issue, we propose Multiscale Video Pretraining (MVP), a novel
self-supervised pretraining approach that learns robust representations for
forecasting by learning to predict contextualized representations of future
video clips over multiple timescales. MVP is based on our observation that
actions in videos have a multiscale nature, where atomic actions typically
occur at a short timescale and more complex actions may span longer timescales.
We compare MVP to state-of-the-art self-supervised video learning approaches on
downstream long-term forecasting tasks including long-term action anticipation
and video summary prediction. Our comprehensive experiments across the Ego4D
and Epic-Kitchens-55/100 datasets demonstrate that MVP out-performs
state-of-the-art methods by significant margins. Notably, MVP obtains a
relative performance gain of over 20% accuracy in video summary forecasting
over existing methods.