MS4UI：多模态用户界面教学视频摘要数据集

摘要

本研究聚焦于教学视频的多模态摘要生成，其目标在于为用户提供一种高效的学习技能方式，通过文本指令与关键视频帧的形式呈现。我们观察到，现有基准数据集主要关注通用的语义层面视频摘要，并不适用于提供逐步可执行的指令与图解，而这两者对于教学视频至关重要。为此，我们提出了一种新颖的用户界面（UI）教学视频摘要基准，以填补这一空白。我们收集了一个包含2,413个UI教学视频的数据集，总时长超过167小时。这些视频经过人工标注，包括视频分割、文本摘要及视频摘要，从而支持对简洁且可执行视频摘要的全面评估。我们在自建的MS4UI数据集上进行了大量实验，结果表明，当前最先进的多模态摘要方法在处理UI视频摘要时面临挑战，凸显了开发针对UI教学视频摘要新方法的重要性。

English

We study multi-modal summarization for instructional videos, whose goal is to provide users an efficient way to learn skills in the form of text instructions and key video frames. We observe that existing benchmarks focus on generic semantic-level video summarization, and are not suitable for providing step-by-step executable instructions and illustrations, both of which are crucial for instructional videos. We propose a novel benchmark for user interface (UI) instructional video summarization to fill the gap. We collect a dataset of 2,413 UI instructional videos, which spans over 167 hours. These videos are manually annotated for video segmentation, text summarization, and video summarization, which enable the comprehensive evaluations for concise and executable video summarization. We conduct extensive experiments on our collected MS4UI dataset, which suggest that state-of-the-art multi-modal summarization methods struggle on UI video summarization, and highlight the importance of new methods for UI instructional video summarization.