MS4UI:面向用户界面教学视频的多模态摘要数据集
MS4UI: A Dataset for Multi-modal Summarization of User Interface Instructional Videos
June 14, 2025
作者: Yuan Zang, Hao Tan, Seunghyun Yoon, Franck Dernoncourt, Jiuxiang Gu, Kushal Kafle, Chen Sun, Trung Bui
cs.AI
摘要
我们研究了教学视频的多模态摘要技术,其目标是为用户提供一种高效的学习方式,通过文本指令和关键视频帧来掌握技能。我们注意到,现有基准主要关注通用的语义级视频摘要,并不适合提供逐步可执行的指令和图示,而这两者对于教学视频至关重要。为此,我们提出了一个新颖的用户界面(UI)教学视频摘要基准,以填补这一空白。我们收集了一个包含2,413个UI教学视频的数据集,总时长超过167小时。这些视频经过人工标注,包括视频分割、文本摘要和视频摘要,从而支持对简洁且可执行视频摘要的全面评估。我们在自建的MS4UI数据集上进行了大量实验,结果表明,当前最先进的多模态摘要方法在UI视频摘要任务上表现欠佳,凸显了开发针对UI教学视频摘要新方法的重要性。
English
We study multi-modal summarization for instructional videos, whose goal is to
provide users an efficient way to learn skills in the form of text instructions
and key video frames. We observe that existing benchmarks focus on generic
semantic-level video summarization, and are not suitable for providing
step-by-step executable instructions and illustrations, both of which are
crucial for instructional videos. We propose a novel benchmark for user
interface (UI) instructional video summarization to fill the gap. We collect a
dataset of 2,413 UI instructional videos, which spans over 167 hours. These
videos are manually annotated for video segmentation, text summarization, and
video summarization, which enable the comprehensive evaluations for concise and
executable video summarization. We conduct extensive experiments on our
collected MS4UI dataset, which suggest that state-of-the-art multi-modal
summarization methods struggle on UI video summarization, and highlight the
importance of new methods for UI instructional video summarization.