UniVBench：ビデオ基盤モデルの統合的評価に向けて

要旨

ビデオ基盤モデルは、映像理解、生成、編集、指示追従を単一フレームワークに統合することを目指し、次世代マルチモーダルシステムの中核的な方向性となっている。しかし、既存の評価ベンチマークは断片的で範囲が限られており、単一タスクを対象とし、タスク固有の指標に依存し、一般的に短いまたは単純な映像クリップを使用している。その結果、これらのモデルが提供を目指す統合的な能力を捉えられていない。この課題を解決するため、我々はUniVBenchを提案する。これは、映像理解、映像生成、映像編集、および新たに提案するタスクである映像再構成（モデルが遭遇した映像内容をどれだけ忠実に再現できるかを評価）の4つの核心能力にわたってビデオ基盤モデルを評価するために設計されたベンチマークである。本ベンチマークは、200本の高品質で多様なマルチショット映像を採用し、それぞれに詳細なキャプション、複数形式の編集指示、参照画像を付属させることで、評価の複雑性を大幅に拡張している。全ての映像は人間によって作成され注意深く検証されたもので、従来のベンチマークよりも豊富な映画的情報を提供する。さらに、全タスクにわたるプロンプト作成、指示解析、採点を標準化する統合エージェント評価システム（UniV-Eval）を開発し、統合ビデオモデルの公平でスケーラブル、再現性のある比較を可能にする。指示ベースのマルチショット映像タスクに評価を根ざすことで、UniVBenchはビデオ基盤モデルが達成を目指す統合能力を測定する初の枠組みを提供する。大規模な人手による注釈により、評価が人間の判断と一致することを保証し、厳密な評価を可能にするとともに、堅牢なビデオ知能に向けた進歩を加速する。

English

Video foundation models aim to integrate video understanding, generation, editing, and instruction following within a single framework, making them a central direction for next-generation multimodal systems. However, existing evaluation benchmarks remain fragmented and limited in scope, as they each target a single task, rely on task-specific metrics, and typically use short or simple video clips. As a result, they do not capture the unified capabilities that these models are designed to deliver. To address this gap, we introduce UniVBench, a benchmark purpose-built for evaluating video foundation models across four core abilities: video understanding, video generation, video editing, and a newly proposed task, video reconstruction, which assesses how faithfully a model can reproduce video content it has encountered. Our benchmark substantially expands the complexity of evaluation by incorporating 200 high-quality, diverse and multi-shot videos, each paired with detailed captions, multi-format editing instructions, and reference images. All videos are human-created and carefully validated, offering richer cinematic information than prior benchmarks. In addition, we develop a unified agentic evaluation system (UniV-Eval) that standardizes prompting, instruction parsing, and scoring across all tasks, enabling fair, scalable, and reproducible comparisons of unified video models. By grounding evaluation in instruction-based multi-shot video tasks, UniVBench provides the first framework for measuring the integrated capabilities that video foundation models aim to achieve. Extensive human annotations ensure our evaluation aligns with human judgment, enabling rigorous assessment and accelerating progress toward robust video intelligence.

UniVBench：ビデオ基盤モデルの統合的評価に向けて

UniVBench: Towards Unified Evaluation for Video Foundation Models

要旨

Support