VideoGLUE：视频通用理解基础模型评估

摘要

我们使用精心设计的实验方案评估现有基础模型在视频理解能力方面的表现，该实验方案包括三个标志性任务（动作识别、时间定位和时空定位）、社区广泛认可的八个数据集以及四种为下游任务定制基础模型（FM）的适应方法。此外，我们提出了一个标量VideoGLUE分数（VGS），用于衡量基础模型（FM）在适应一般视频理解任务时的有效性和效率。我们的主要发现如下。首先，与自然语言和图像理解中基础模型所取得的成就形成鲜明对比，本研究中研究的六个基础模型明显不及任务专门化模型。其次，包含视频模态的视频原生FM在分类运动丰富的视频、时间上定位动作以及理解包含多个动作的视频方面通常优于包含图像模态的FM。第三，视频原生FM在轻微适应下游任务（例如冻结FM主干）时可以在视频任务上表现良好，而图像原生FM在全面端到端微调中获胜。前两个观察结果揭示了在视频焦点基础模型上进行研究的必要性和巨大机遇，最后确认了任务和适应方法在评估基础模型时的重要性。

English

We evaluate existing foundation models video understanding capabilities using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition, temporal localization, and spatiotemporal localization), eight datasets well received by the community, and four adaptation methods tailoring a foundation model (FM) for a downstream task. Moreover, we propose a scalar VideoGLUE score (VGS) to measure an FMs efficacy and efficiency when adapting to general video understanding tasks. Our main findings are as follows. First, task-specialized models significantly outperform the six FMs studied in this work, in sharp contrast to what FMs have achieved in natural language and image understanding. Second,video-native FMs, whose pretraining data contains the video modality, are generally better than image-native FMs in classifying motion-rich videos, localizing actions in time, and understanding a video of more than one action. Third, the video-native FMs can perform well on video tasks under light adaptations to downstream tasks(e.g., freezing the FM backbones), while image-native FMs win in full end-to-end finetuning. The first two observations reveal the need and tremendous opportunities to conduct research on video-focused FMs, and the last confirms that both tasks and adaptation methods matter when it comes to the evaluation of FMs.

VideoGLUE：视频通用理解基础模型评估

VideoGLUE: Video General Understanding Evaluation of Foundation Models

摘要

Support