VideoGLUE：影片基礎模型的影片通用理解評估

摘要

我們使用精心設計的實驗協議來評估現有的基礎模型在視頻理解能力上的表現，這包括三個標誌性任務（動作識別、時間定位和時空定位）、社群廣泛接受的八個數據集，以及四種適應方法來調整基礎模型（FM）以應用於下游任務。此外，我們提出一個標量VideoGLUE分數（VGS）來衡量基礎模型（FM）在適應一般視頻理解任務時的效力和效率。我們的主要發現如下。首先，在這項研究中研究的六個FM中，任務專用模型明顯優於它們，這與FM在自然語言和圖像理解中取得的成就形成鮮明對比。其次，具有視頻模態的視頻原生FM通常優於具有圖像模態的FM，能更好地對運動豐富的視頻進行分類，定位動作時間，以及理解包含多個動作的視頻。第三，視頻原生FM在對下游任務進行輕微適應（例如凍結FM主幹）時可以表現良好，而圖像原生FM在完全端到端微調中勝出。前兩點觀察顯示了在視頻專注的FM上進行研究的必要性和巨大機遇，而最後一點證實了在評估FM時任務和適應方法都很重要。

English

We evaluate existing foundation models video understanding capabilities using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition, temporal localization, and spatiotemporal localization), eight datasets well received by the community, and four adaptation methods tailoring a foundation model (FM) for a downstream task. Moreover, we propose a scalar VideoGLUE score (VGS) to measure an FMs efficacy and efficiency when adapting to general video understanding tasks. Our main findings are as follows. First, task-specialized models significantly outperform the six FMs studied in this work, in sharp contrast to what FMs have achieved in natural language and image understanding. Second,video-native FMs, whose pretraining data contains the video modality, are generally better than image-native FMs in classifying motion-rich videos, localizing actions in time, and understanding a video of more than one action. Third, the video-native FMs can perform well on video tasks under light adaptations to downstream tasks(e.g., freezing the FM backbones), while image-native FMs win in full end-to-end finetuning. The first two observations reveal the need and tremendous opportunities to conduct research on video-focused FMs, and the last confirms that both tasks and adaptation methods matter when it comes to the evaluation of FMs.

VideoGLUE：影片基礎模型的影片通用理解評估

VideoGLUE: Video General Understanding Evaluation of Foundation Models

摘要

Support