ChatPaper.aiChatPaper

TWLV-I:基于视频基础模型的整体评估分析与见解

TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models

August 21, 2024
作者: Hyeongmin Lee, Jin-Young Kim, Kyungjune Baek, Jihwan Kim, Hyojun Go, Seongsu Ha, Seokjin Han, Jiho Jang, Raehyuk Jung, Daewoo Kim, GeunOh Kim, JongMok Kim, Jongseok Kim, Junwan Kim, Soonwoo Kwon, Jangwon Lee, Seungjoon Park, Minjoon Seo, Jay Suh, Jaehyuk Yi, Aiden Lee
cs.AI

摘要

在这项工作中,我们讨论了以公平和稳健的方式评估视频基础模型。与语言或图像基础模型不同,许多视频基础模型是以不同的参数(如采样率、帧数、预训练步骤等)进行评估的,这使得进行公平和稳健的比较具有挑战性。因此,我们提出了一个精心设计的评估框架,用于衡量视频理解的两个核心能力:外观和运动理解。我们的研究发现,现有的视频基础模型,无论是像UMT或InternVideo2这样的文本监督模型,还是像V-JEPA这样的自监督模型,在这些能力中至少存在一种局限性。作为替代方案,我们介绍了TWLV-I,这是一个新的视频基础模型,为基于运动和外观的视频构建了稳健的视觉表示。根据仅在公开可访问数据集上预训练的模型在五个动作识别基准测试上进行线性探测的平均top-1准确率,我们的模型相比于V-JEPA(ViT-L)提高了4.6个百分点,相比于UMT(ViT-L)提高了7.7个百分点。即使与更大的模型进行比较,我们的模型与DFN(ViT-H)相比提高了7.2个百分点,与V-JEPA(ViT-H)相比提高了2.7个百分点,与InternVideo2(ViT-g)相比提高了2.8个百分点。我们提供了由TWLV-I从几个常用视频基准测试的视频中获得的嵌入向量,以及可以直接利用这些嵌入的评估源代码。该代码可在"https://github.com/twelvelabs-io/video-embeddings-evaluation-framework"上找到。
English
In this work, we discuss evaluating video foundation models in a fair and robust manner. Unlike language or image foundation models, many video foundation models are evaluated with differing parameters (such as sampling rate, number of frames, pretraining steps, etc.), making fair and robust comparisons challenging. Therefore, we present a carefully designed evaluation framework for measuring two core capabilities of video comprehension: appearance and motion understanding. Our findings reveal that existing video foundation models, whether text-supervised like UMT or InternVideo2, or self-supervised like V-JEPA, exhibit limitations in at least one of these capabilities. As an alternative, we introduce TWLV-I, a new video foundation model that constructs robust visual representations for both motion- and appearance-based videos. Based on the average top-1 accuracy of linear probing on five action recognition benchmarks, pretrained only on publicly accessible datasets, our model shows a 4.6%p improvement compared to V-JEPA (ViT-L) and a 7.7%p improvement compared to UMT (ViT-L). Even when compared to much larger models, our model demonstrates a 7.2%p improvement compared to DFN (ViT-H), a 2.7%p improvement compared to V-JEPA~(ViT-H) and a 2.8%p improvement compared to InternVideo2 (ViT-g). We provide embedding vectors obtained by TWLV-I from videos of several commonly used video benchmarks, along with evaluation source code that can directly utilize these embeddings. The code is available on "https://github.com/twelvelabs-io/video-embeddings-evaluation-framework".

Summary

AI-Generated Summary

PDF572November 16, 2024