ChatPaper.aiChatPaper

TWLV-I:從對視頻基礎模型的全面評估中的分析和見解

TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models

August 21, 2024
作者: Hyeongmin Lee, Jin-Young Kim, Kyungjune Baek, Jihwan Kim, Hyojun Go, Seongsu Ha, Seokjin Han, Jiho Jang, Raehyuk Jung, Daewoo Kim, GeunOh Kim, JongMok Kim, Jongseok Kim, Junwan Kim, Soonwoo Kwon, Jangwon Lee, Seungjoon Park, Minjoon Seo, Jay Suh, Jaehyuk Yi, Aiden Lee
cs.AI

摘要

在這份工作中,我們討論如何以公平且穩健的方式評估影片基礎模型。與語言或圖像基礎模型不同,許多影片基礎模型的評估使用不同的參數(例如取樣率、幀數、預訓練步驟等),這使得進行公平且穩健的比較具有挑戰性。因此,我們提出了一個精心設計的評估框架,用於衡量影片理解的兩個核心能力:外觀和動作理解。我們的研究結果顯示,現有的影片基礎模型,無論是像 UMT 或 InternVideo2 這樣的文本監督模型,還是像 V-JEPA 這樣的自監督模型,在這些能力中至少存在一定的局限性。作為一種替代方案,我們介紹了 TWLV-I,一種新的影片基礎模型,為基於動作和外觀的影片構建了穩健的視覺表示。基於僅在公開可訪問數據集上預訓練的模型在五個動作識別基準測試上的平均頂部-1準確率,我們的模型相比 V-JEPA(ViT-L)提高了4.6個百分點,相比 UMT(ViT-L)提高了7.7個百分點。即使與更大的模型進行比較,我們的模型相比 DFN(ViT-H)提高了7.2個百分點,相比 V-JEPA(ViT-H)提高了2.7個百分點,相比 InternVideo2(ViT-g)提高了2.8個百分點。我們提供了 TWLV-I 從幾個常用影片基準測試的影片中獲得的嵌入向量,以及可以直接利用這些嵌入的評估源代碼。代碼可在"https://github.com/twelvelabs-io/video-embeddings-evaluation-framework"上找到。
English
In this work, we discuss evaluating video foundation models in a fair and robust manner. Unlike language or image foundation models, many video foundation models are evaluated with differing parameters (such as sampling rate, number of frames, pretraining steps, etc.), making fair and robust comparisons challenging. Therefore, we present a carefully designed evaluation framework for measuring two core capabilities of video comprehension: appearance and motion understanding. Our findings reveal that existing video foundation models, whether text-supervised like UMT or InternVideo2, or self-supervised like V-JEPA, exhibit limitations in at least one of these capabilities. As an alternative, we introduce TWLV-I, a new video foundation model that constructs robust visual representations for both motion- and appearance-based videos. Based on the average top-1 accuracy of linear probing on five action recognition benchmarks, pretrained only on publicly accessible datasets, our model shows a 4.6%p improvement compared to V-JEPA (ViT-L) and a 7.7%p improvement compared to UMT (ViT-L). Even when compared to much larger models, our model demonstrates a 7.2%p improvement compared to DFN (ViT-H), a 2.7%p improvement compared to V-JEPA~(ViT-H) and a 2.8%p improvement compared to InternVideo2 (ViT-g). We provide embedding vectors obtained by TWLV-I from videos of several commonly used video benchmarks, along with evaluation source code that can directly utilize these embeddings. The code is available on "https://github.com/twelvelabs-io/video-embeddings-evaluation-framework".

Summary

AI-Generated Summary

PDF572November 16, 2024