Vidi: ビデオ理解と編集のための大規模マルチモーダルモデル

要旨

人間は自然に接続された相手と情報を共有し、動画はインターネット上でのコミュニケーションと表現の主要なメディアの一つとなっています。高品質な大規模動画コンテンツの作成を支援するため、現代のパイプラインでは、生の入力素材（例えば、カメラで撮影された未編集の映像）と編集コンポーネント（例えば、視覚効果）の両方を包括的に理解する必要があります。動画編集のシナリオでは、モデルは複数のモダリティ（例えば、視覚、音声、テキスト）を強力な背景知識で処理し、柔軟な入力長（例えば、1時間に及ぶ生の動画）を扱う必要があり、これは従来のモデルにとって大きな課題となっています。本報告書では、幅広い動画理解編集シナリオに対応する大規模マルチモーダルモデル（LMM）ファミリーであるVidiを紹介します。最初のリリースでは、テンポラルリトリーバル、つまり与えられたテキストクエリに対応する入力動画内の時間範囲を特定することに焦点を当てており、これはインテリジェントな編集において重要な役割を果たします。このモデルは、1時間に及ぶ動画を処理し、特定のクエリに対する時間範囲を検索するなど、強力な時間理解能力を備えています。現実世界のシナリオでの包括的な評価を支援するため、VUE-TRベンチマークも提示します。これは、以下の5つの主要な進歩を導入しています。1）動画の長さ：既存のテンポラルリトリーバルデータセットよりも大幅に長い、2）音声サポート：音声ベースのクエリを含む、3）クエリ形式：多様なクエリの長さ/形式、4）アノテーション品質：グラウンドトゥルースの時間範囲を手動でアノテーション、5）評価指標：複数の時間範囲にわたる評価を支援する改良されたIoU指標。注目すべきは、Vidiがテンポラルリトリーバルタスクにおいて、GPT-4oやGeminiなどの主要なプロプライエタリモデルを大幅に上回り、動画編集シナリオでの優位性を示していることです。

English

Humans naturally share information with those they are connected to, and video has become one of the dominant mediums for communication and expression on the Internet. To support the creation of high-quality large-scale video content, a modern pipeline requires a comprehensive understanding of both the raw input materials (e.g., the unedited footage captured by cameras) and the editing components (e.g., visual effects). In video editing scenarios, models must process multiple modalities (e.g., vision, audio, text) with strong background knowledge and handle flexible input lengths (e.g., hour-long raw videos), which poses significant challenges for traditional models. In this report, we introduce Vidi, a family of Large Multimodal Models (LMMs) for a wide range of video understand editing scenarios. The first release focuses on temporal retrieval, i.e., identifying the time ranges within the input videos corresponding to a given text query, which plays a critical role in intelligent editing. The model is capable of processing hour-long videos with strong temporal understanding capability, e.g., retrieve time ranges for certain queries. To support a comprehensive evaluation in real-world scenarios, we also present the VUE-TR benchmark, which introduces five key advancements. 1) Video duration: significantly longer than existing temporal retrival datasets, 2) Audio support: includes audio-based queries, 3) Query format: diverse query lengths/formats, 4) Annotation quality: ground-truth time ranges are manually annotated. 5) Evaluation metric: a refined IoU metric to support evaluation over multiple time ranges. Remarkably, Vidi significantly outperforms leading proprietary models, e.g., GPT-4o and Gemini, on the temporal retrieval task, indicating its superiority in video editing scenarios.

Vidi: ビデオ理解と編集のための大規模マルチモーダルモデル

Vidi: Large Multimodal Models for Video Understanding and Editing

要旨

Support