TWLV-I: ビデオにおける包括的評価からの分析と洞察ファウンデーションモデル

要旨

この論文では、公正かつ堅牢な方法でビデオの基盤モデルを評価することについて議論します。言語や画像の基盤モデルとは異なり、多くのビデオの基盤モデルは異なるパラメータ（サンプリングレート、フレーム数、事前トレーニングステップなど）で評価されるため、公正かつ堅牢な比較が困難です。そのため、ビデオの理解の2つの中核的な能力、外観理解と動き理解を測定するための注意深く設計された評価フレームワークを提案します。我々の調査結果によると、既存のビデオの基盤モデル、UMTやInternVideo2のようなテキスト監督型、V-JEPAのような自己監督型を含むモデルは、少なくともこれらの能力のいずれかにおいて制限があることが明らかになりました。その代替案として、動きベースと外観ベースのビデオのために堅牢な視覚表現を構築する新しいビデオの基盤モデルであるTWLV-Iを紹介します。公開されているデータセットのみで事前トレーニングされた、5つのアクション認識ベンチマークでの線形プロービングの平均トップ1精度に基づくと、当社のモデルはV-JEPA（ViT-L）と比較して4.6%pの改善、UMT（ViT-L）と比較して7.7%pの改善を示しました。さらに、はるかに大きなモデルと比較しても、当社のモデルはDFN（ViT-H）と比較して7.2%p、V-JEPA（ViT-H）と比較して2.7%p、InternVideo2（ViT-g）と比較して2.8%pの改善を示しました。また、一般的に使用されるいくつかのビデオベンチマークのビデオからTWLV-Iによって取得された埋め込みベクトルを提供し、これらの埋め込みを直接利用できる評価ソースコードも提供します。コードは"https://github.com/twelvelabs-io/video-embeddings-evaluation-framework"で入手可能です。

English

In this work, we discuss evaluating video foundation models in a fair and robust manner. Unlike language or image foundation models, many video foundation models are evaluated with differing parameters (such as sampling rate, number of frames, pretraining steps, etc.), making fair and robust comparisons challenging. Therefore, we present a carefully designed evaluation framework for measuring two core capabilities of video comprehension: appearance and motion understanding. Our findings reveal that existing video foundation models, whether text-supervised like UMT or InternVideo2, or self-supervised like V-JEPA, exhibit limitations in at least one of these capabilities. As an alternative, we introduce TWLV-I, a new video foundation model that constructs robust visual representations for both motion- and appearance-based videos. Based on the average top-1 accuracy of linear probing on five action recognition benchmarks, pretrained only on publicly accessible datasets, our model shows a 4.6%p improvement compared to V-JEPA (ViT-L) and a 7.7%p improvement compared to UMT (ViT-L). Even when compared to much larger models, our model demonstrates a 7.2%p improvement compared to DFN (ViT-H), a 2.7%p improvement compared to V-JEPA~(ViT-H) and a 2.8%p improvement compared to InternVideo2 (ViT-g). We provide embedding vectors obtained by TWLV-I from videos of several commonly used video benchmarks, along with evaluation source code that can directly utilize these embeddings. The code is available on "https://github.com/twelvelabs-io/video-embeddings-evaluation-framework".

TWLV-I: ビデオにおける包括的評価からの分析と洞察ファウンデーションモデル

TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models

要旨

Support