透過旁白學習將教學文章與影片相關聯
Learning to Ground Instructional Articles in Videos through Narrations
June 6, 2023
作者: Effrosyni Mavroudi, Triantafyllos Afouras, Lorenzo Torresani
cs.AI
摘要
本文提出了一種方法,用於在教學視頻中定位程序活動的步驟。為了應對標記數據的稀缺性,我們從包含大量程序任務的指導性文章的語言知識庫(wikiHow)中獲取步驟描述。在沒有任何形式的手動監督的情況下,我們的模型通過匹配三種模態:幀、敘述和步驟描述,學會了在教學視頻中暫時地將程序文章的步驟定位。具體而言,我們的方法通過融合來自兩個不同途徑的信息來將步驟對齊到視頻:i)將步驟描述直接對齊到幀,ii)通過將步驟對敘述與敘述對視頻的對應進行間接對齊。值得注意的是,我們的方法通過利用順序信息一次性執行文章中所有步驟的全局時間定位,並通過迭代地優化和積極過濾步驟虛標籤進行訓練。為了驗證我們的模型,我們引入了一個新的評估基準 - HT-Step,通過手動注釋來自wikiHow文章的124小時子集獲得。在這個基準上的實驗以及在CrossTask上的零-shot評估表明,我們的多模態對齊相對於幾個基線和先前作品取得了顯著的增益。最後,我們展示了我們用於匹配敘述到視頻的內部模塊在HTM-Align敘述-視頻對齊基準上遠遠優於現有技術水平。
English
In this paper we present an approach for localizing steps of procedural
activities in narrated how-to videos. To deal with the scarcity of labeled data
at scale, we source the step descriptions from a language knowledge base
(wikiHow) containing instructional articles for a large variety of procedural
tasks. Without any form of manual supervision, our model learns to temporally
ground the steps of procedural articles in how-to videos by matching three
modalities: frames, narrations, and step descriptions. Specifically, our method
aligns steps to video by fusing information from two distinct pathways: i) {\em
direct} alignment of step descriptions to frames, ii) {\em indirect} alignment
obtained by composing steps-to-narrations with narrations-to-video
correspondences. Notably, our approach performs global temporal grounding of
all steps in an article at once by exploiting order information, and is trained
with step pseudo-labels which are iteratively refined and aggressively
filtered. In order to validate our model we introduce a new evaluation
benchmark -- HT-Step -- obtained by manually annotating a 124-hour subset of
HowTo100MA test server is accessible at
\url{https://eval.ai/web/challenges/challenge-page/2082.} with steps sourced
from wikiHow articles. Experiments on this benchmark as well as zero-shot
evaluations on CrossTask demonstrate that our multi-modality alignment yields
dramatic gains over several baselines and prior works. Finally, we show that
our inner module for matching narration-to-video outperforms by a large margin
the state of the art on the HTM-Align narration-video alignment benchmark.