通过叙述学习将教学文章与视频联系起来
Learning to Ground Instructional Articles in Videos through Narrations
June 6, 2023
作者: Effrosyni Mavroudi, Triantafyllos Afouras, Lorenzo Torresani
cs.AI
摘要
本文介绍了一种在叙述的操作性视频中定位步骤的方法。为了解决缺乏大规模标记数据的问题,我们从包含大量操作性任务说明文章的语言知识库(wikiHow)中获取步骤描述。在没有任何形式的手动监督的情况下,我们的模型通过匹配三种模态:帧、叙述和步骤描述,学会了将操作性文章的步骤在操作性视频中进行时间上的定位。具体来说,我们的方法通过融合两个不同路径的信息来将步骤与视频进行对齐:i)将步骤描述直接对齐到帧,ii)通过将步骤与叙述组合,再将叙述与视频对应来获得间接对齐。值得注意的是,我们的方法通过利用顺序信息一次性执行文章中所有步骤的全局时间定位,并使用经过迭代细化和积极过滤的步骤伪标签进行训练。为了验证我们的模型,我们引入了一个新的评估基准 - HT-Step,通过手动注释从wikiHow文章中获取的HowTo100MA测试服务器的124小时子集获得,可在\url{https://eval.ai/web/challenges/challenge-page/2082.}上访问。在这个基准上的实验以及在CrossTask上的零-shot评估表明,我们的多模态对齐相对于几种基线和先前工作取得了显著的增益。最后,我们展示了我们用于匹配叙述与视频的内部模块在HTM-Align叙述-视频对齐基准上远远优于现有技术水平。
English
In this paper we present an approach for localizing steps of procedural
activities in narrated how-to videos. To deal with the scarcity of labeled data
at scale, we source the step descriptions from a language knowledge base
(wikiHow) containing instructional articles for a large variety of procedural
tasks. Without any form of manual supervision, our model learns to temporally
ground the steps of procedural articles in how-to videos by matching three
modalities: frames, narrations, and step descriptions. Specifically, our method
aligns steps to video by fusing information from two distinct pathways: i) {\em
direct} alignment of step descriptions to frames, ii) {\em indirect} alignment
obtained by composing steps-to-narrations with narrations-to-video
correspondences. Notably, our approach performs global temporal grounding of
all steps in an article at once by exploiting order information, and is trained
with step pseudo-labels which are iteratively refined and aggressively
filtered. In order to validate our model we introduce a new evaluation
benchmark -- HT-Step -- obtained by manually annotating a 124-hour subset of
HowTo100MA test server is accessible at
\url{https://eval.ai/web/challenges/challenge-page/2082.} with steps sourced
from wikiHow articles. Experiments on this benchmark as well as zero-shot
evaluations on CrossTask demonstrate that our multi-modality alignment yields
dramatic gains over several baselines and prior works. Finally, we show that
our inner module for matching narration-to-video outperforms by a large margin
the state of the art on the HTM-Align narration-video alignment benchmark.