Multischaal video-pre-training voor langetermijnactiviteitsvoorspelling

Samenvatting

Langetermijnactiviteitsvoorspelling is een bijzonder uitdagend onderzoeksprobleem omdat het inzicht vereist in de temporele relaties tussen waargenomen acties, evenals in de variabiliteit en complexiteit van menselijke activiteiten. Ondanks het gebruik van sterke supervisie via kostbare menselijke annotaties, generaliseren state-of-the-art voorspellingsmethoden vaak slecht naar onbekende gegevens. Om dit probleem te verlichten, stellen we Multiscale Video Pretraining (MVP) voor, een nieuwe zelfgesuperviseerde voorbereidingsmethode die robuuste representaties leert voor voorspelling door het voorspellen van gecontextualiseerde representaties van toekomstige videoclips over meerdere tijdschalen te leren. MVP is gebaseerd op onze observatie dat acties in video's een multiscale aard hebben, waarbij atomische acties meestal op een kort tijdsbestek plaatsvinden en complexere acties over langere tijdschalen kunnen uitstrekken. We vergelijken MVP met state-of-the-art zelfgesuperviseerde videoleermethoden voor downstream langetermijnvoorspellingstaken, waaronder langetermijnactieanticipering en video-samenvattingsvoorspelling. Onze uitgebreide experimenten over de Ego4D en Epic-Kitchens-55/100 datasets tonen aan dat MVP state-of-the-art methoden met aanzienlijke marges overtreft. Opmerkelijk is dat MVP een relatieve prestatieverbetering van meer dan 20% nauwkeurigheid behaalt in video-samenvattingsvoorspelling ten opzichte van bestaande methoden.

English

Long-term activity forecasting is an especially challenging research problem because it requires understanding the temporal relationships between observed actions, as well as the variability and complexity of human activities. Despite relying on strong supervision via expensive human annotations, state-of-the-art forecasting approaches often generalize poorly to unseen data. To alleviate this issue, we propose Multiscale Video Pretraining (MVP), a novel self-supervised pretraining approach that learns robust representations for forecasting by learning to predict contextualized representations of future video clips over multiple timescales. MVP is based on our observation that actions in videos have a multiscale nature, where atomic actions typically occur at a short timescale and more complex actions may span longer timescales. We compare MVP to state-of-the-art self-supervised video learning approaches on downstream long-term forecasting tasks including long-term action anticipation and video summary prediction. Our comprehensive experiments across the Ego4D and Epic-Kitchens-55/100 datasets demonstrate that MVP out-performs state-of-the-art methods by significant margins. Notably, MVP obtains a relative performance gain of over 20% accuracy in video summary forecasting over existing methods.

Multischaal video-pre-training voor langetermijnactiviteitsvoorspelling

Multiscale Video Pretraining for Long-Term Activity Forecasting

Samenvatting

Support