MS4UI：面向用户界面教学视频的多模态摘要数据集

摘要

我们研究了教学视频的多模态摘要技术，其目标是为用户提供一种高效的学习方式，通过文本指令和关键视频帧来掌握技能。我们注意到，现有基准主要关注通用的语义级视频摘要，并不适合提供逐步可执行的指令和图示，而这两者对于教学视频至关重要。为此，我们提出了一个新颖的用户界面（UI）教学视频摘要基准，以填补这一空白。我们收集了一个包含2,413个UI教学视频的数据集，总时长超过167小时。这些视频经过人工标注，包括视频分割、文本摘要和视频摘要，从而支持对简洁且可执行视频摘要的全面评估。我们在自建的MS4UI数据集上进行了大量实验，结果表明，当前最先进的多模态摘要方法在UI视频摘要任务上表现欠佳，凸显了开发针对UI教学视频摘要新方法的重要性。

English

We study multi-modal summarization for instructional videos, whose goal is to provide users an efficient way to learn skills in the form of text instructions and key video frames. We observe that existing benchmarks focus on generic semantic-level video summarization, and are not suitable for providing step-by-step executable instructions and illustrations, both of which are crucial for instructional videos. We propose a novel benchmark for user interface (UI) instructional video summarization to fill the gap. We collect a dataset of 2,413 UI instructional videos, which spans over 167 hours. These videos are manually annotated for video segmentation, text summarization, and video summarization, which enable the comprehensive evaluations for concise and executable video summarization. We conduct extensive experiments on our collected MS4UI dataset, which suggest that state-of-the-art multi-modal summarization methods struggle on UI video summarization, and highlight the importance of new methods for UI instructional video summarization.