MS4UI:多模态用户界面教学视频摘要数据集
MS4UI: A Dataset for Multi-modal Summarization of User Interface Instructional Videos
June 14, 2025
作者: Yuan Zang, Hao Tan, Seunghyun Yoon, Franck Dernoncourt, Jiuxiang Gu, Kushal Kafle, Chen Sun, Trung Bui
cs.AI
摘要
本研究聚焦于教学视频的多模态摘要生成,其目标在于为用户提供一种高效的学习技能方式,通过文本指令与关键视频帧的形式呈现。我们观察到,现有基准数据集主要关注通用的语义层面视频摘要,并不适用于提供逐步可执行的指令与图解,而这两者对于教学视频至关重要。为此,我们提出了一种新颖的用户界面(UI)教学视频摘要基准,以填补这一空白。我们收集了一个包含2,413个UI教学视频的数据集,总时长超过167小时。这些视频经过人工标注,包括视频分割、文本摘要及视频摘要,从而支持对简洁且可执行视频摘要的全面评估。我们在自建的MS4UI数据集上进行了大量实验,结果表明,当前最先进的多模态摘要方法在处理UI视频摘要时面临挑战,凸显了开发针对UI教学视频摘要新方法的重要性。
English
We study multi-modal summarization for instructional videos, whose goal is to
provide users an efficient way to learn skills in the form of text instructions
and key video frames. We observe that existing benchmarks focus on generic
semantic-level video summarization, and are not suitable for providing
step-by-step executable instructions and illustrations, both of which are
crucial for instructional videos. We propose a novel benchmark for user
interface (UI) instructional video summarization to fill the gap. We collect a
dataset of 2,413 UI instructional videos, which spans over 167 hours. These
videos are manually annotated for video segmentation, text summarization, and
video summarization, which enable the comprehensive evaluations for concise and
executable video summarization. We conduct extensive experiments on our
collected MS4UI dataset, which suggest that state-of-the-art multi-modal
summarization methods struggle on UI video summarization, and highlight the
importance of new methods for UI instructional video summarization.