長期活動予測のためのマルチスケール動画事前学習

要旨

長期活動予測は特に挑戦的な研究課題である。なぜなら、観測された行動間の時間的関係を理解するだけでなく、人間の活動の多様性と複雑性を把握する必要があるからだ。高価な人間によるアノテーションを通じた強力な教師あり学習に依存しているにもかかわらず、最先端の予測手法は未見のデータに対してしばしば汎化性能が低い。この問題を緩和するため、我々はMultiscale Video Pretraining（MVP）を提案する。これは、複数の時間スケールにわたって将来のビデオクリップの文脈化された表現を予測することを学習することで、予測のための頑健な表現を学習する新しい自己教師あり事前学習手法である。MVPは、ビデオ内の行動がマルチスケールの性質を持つという我々の観察に基づいている。ここで、基本的な行動は通常短い時間スケールで発生し、より複雑な行動はより長い時間スケールにわたる可能性がある。我々は、長期行動予測やビデオ要約予測を含む下流の長期予測タスクにおいて、MVPを最先端の自己教師ありビデオ学習手法と比較する。Ego4DおよびEpic-Kitchens-55/100データセットにわたる包括的な実験により、MVPが最先端の手法を大幅に上回ることを示す。特に、MVPはビデオ要約予測において既存手法に対して20%以上の相対的な精度向上を達成する。

English

Long-term activity forecasting is an especially challenging research problem because it requires understanding the temporal relationships between observed actions, as well as the variability and complexity of human activities. Despite relying on strong supervision via expensive human annotations, state-of-the-art forecasting approaches often generalize poorly to unseen data. To alleviate this issue, we propose Multiscale Video Pretraining (MVP), a novel self-supervised pretraining approach that learns robust representations for forecasting by learning to predict contextualized representations of future video clips over multiple timescales. MVP is based on our observation that actions in videos have a multiscale nature, where atomic actions typically occur at a short timescale and more complex actions may span longer timescales. We compare MVP to state-of-the-art self-supervised video learning approaches on downstream long-term forecasting tasks including long-term action anticipation and video summary prediction. Our comprehensive experiments across the Ego4D and Epic-Kitchens-55/100 datasets demonstrate that MVP out-performs state-of-the-art methods by significant margins. Notably, MVP obtains a relative performance gain of over 20% accuracy in video summary forecasting over existing methods.

長期活動予測のためのマルチスケール動画事前学習

Multiscale Video Pretraining for Long-Term Activity Forecasting

要旨

Support