多尺度视频预训练用于长期活动预测

摘要

长期活动预测是一个特别具有挑战性的研究问题，因为它需要理解观察到的行为之间的时间关系，以及人类活动的变化和复杂性。尽管依赖于通过昂贵的人类注释进行强监督，但最先进的预测方法通常在未见数据上泛化能力较差。为了缓解这一问题，我们提出了多尺度视频预训练（MVP），这是一种新颖的自监督预训练方法，通过学习在多个时间尺度上预测未来视频片段的情境化表示来学习为预测而设计的稳健表示。MVP基于我们的观察，即视频中的行为具有多尺度性质，其中原子行为通常发生在较短的时间尺度上，而更复杂的行为可能跨越较长的时间尺度。我们将MVP与最先进的自监督视频学习方法进行了比较，应用于包括长期行为预期和视频摘要预测在内的下游长期预测任务。我们在Ego4D和Epic-Kitchens-55/100数据集上进行的全面实验表明，MVP在很大程度上优于最先进的方法。值得注意的是，MVP在视频摘要预测方面相对性能提升超过20%的准确率，超过现有方法。

English

Long-term activity forecasting is an especially challenging research problem because it requires understanding the temporal relationships between observed actions, as well as the variability and complexity of human activities. Despite relying on strong supervision via expensive human annotations, state-of-the-art forecasting approaches often generalize poorly to unseen data. To alleviate this issue, we propose Multiscale Video Pretraining (MVP), a novel self-supervised pretraining approach that learns robust representations for forecasting by learning to predict contextualized representations of future video clips over multiple timescales. MVP is based on our observation that actions in videos have a multiscale nature, where atomic actions typically occur at a short timescale and more complex actions may span longer timescales. We compare MVP to state-of-the-art self-supervised video learning approaches on downstream long-term forecasting tasks including long-term action anticipation and video summary prediction. Our comprehensive experiments across the Ego4D and Epic-Kitchens-55/100 datasets demonstrate that MVP out-performs state-of-the-art methods by significant margins. Notably, MVP obtains a relative performance gain of over 20% accuracy in video summary forecasting over existing methods.

多尺度视频预训练用于长期活动预测

Multiscale Video Pretraining for Long-Term Activity Forecasting

摘要

Support