Leren om instructieartikelen te verankeren in video's via narraties

Samenvatting

In dit artikel presenteren we een aanpak voor het lokaliseren van stappen in procedurele activiteiten in begeleide how-to video's. Om het gebrek aan gelabelde data op grote schaal aan te pakken, halen we de stapbeschrijvingen uit een taalkundige kennisbank (wikiHow) die instructieartikelen bevat voor een grote verscheidenheid aan procedurele taken. Zonder enige vorm van handmatige supervisie leert ons model om de stappen van procedurele artikelen tijdelijk te verankeren in how-to video's door drie modaliteiten te matchen: frames, voice-overs en stapbeschrijvingen. Specifiek aligneert onze methode stappen aan video's door informatie te combineren uit twee verschillende paden: i) {\em directe} alignering van stapbeschrijvingen aan frames, ii) {\em indirecte} alignering verkregen door het samenstellen van stappen-naar-voice-overs met voice-overs-naar-video-correspondenties. Opmerkelijk is dat onze aanpak een globale temporele verankering van alle stappen in een artikel tegelijkertijd uitvoert door gebruik te maken van volgorde-informatie, en wordt getraind met stap-pseudo-labels die iteratief worden verfijnd en agressief gefilterd. Om ons model te valideren introduceren we een nieuwe evaluatiebenchmark -- HT-Step -- verkregen door handmatige annotatie van een subset van 124 uur van HowTo100M. De testserver is toegankelijk op \url{https://eval.ai/web/challenges/challenge-page/2082.} met stappen afkomstig uit wikiHow-artikelen. Experimenten op deze benchmark, evenals zero-shot evaluaties op CrossTask, tonen aan dat onze multi-modaliteit-alignering aanzienlijke verbeteringen oplevert ten opzichte van verschillende baselines en eerdere werken. Tot slot laten we zien dat ons interne module voor het matchen van voice-overs aan video's met een grote marge de state of the art overtreft op de HTM-Align voice-over-video-aligneringsbenchmark.

English

In this paper we present an approach for localizing steps of procedural activities in narrated how-to videos. To deal with the scarcity of labeled data at scale, we source the step descriptions from a language knowledge base (wikiHow) containing instructional articles for a large variety of procedural tasks. Without any form of manual supervision, our model learns to temporally ground the steps of procedural articles in how-to videos by matching three modalities: frames, narrations, and step descriptions. Specifically, our method aligns steps to video by fusing information from two distinct pathways: i) {\em direct} alignment of step descriptions to frames, ii) {\em indirect} alignment obtained by composing steps-to-narrations with narrations-to-video correspondences. Notably, our approach performs global temporal grounding of all steps in an article at once by exploiting order information, and is trained with step pseudo-labels which are iteratively refined and aggressively filtered. In order to validate our model we introduce a new evaluation benchmark -- HT-Step -- obtained by manually annotating a 124-hour subset of HowTo100MA test server is accessible at \url{https://eval.ai/web/challenges/challenge-page/2082.} with steps sourced from wikiHow articles. Experiments on this benchmark as well as zero-shot evaluations on CrossTask demonstrate that our multi-modality alignment yields dramatic gains over several baselines and prior works. Finally, we show that our inner module for matching narration-to-video outperforms by a large margin the state of the art on the HTM-Align narration-video alignment benchmark.

Leren om instructieartikelen te verankeren in video's via narraties

Learning to Ground Instructional Articles in Videos through Narrations

Samenvatting

Support