UniVBench：面向视频基础模型的统一评估框架

摘要

视频基础模型旨在将视频理解、生成、编辑与指令跟随能力整合于统一框架，已成为下一代多模态系统的核心发展方向。然而现有评估基准仍存在碎片化与局限性：各基准仅针对单一任务、依赖特定指标，且多采用简短或简单的视频片段，无法全面衡量这类模型设计的综合能力。为填补这一空白，我们推出UniVBench——专为评估视频基础模型四大核心能力构建的基准框架，包括视频理解、视频生成、视频编辑及新提出的视频重建任务（用于评估模型对已接触视频内容的还原忠实度）。该基准通过纳入200段高质量、多镜头且内容多样的视频，显著提升了评估复杂度。每段视频均配有详细描述、多格式编辑指令及参考图像，所有素材均经人工创作与严格验证，相比现有基准能提供更丰富的影像信息。此外，我们开发了统一智能评估系统（UniV-Eval），通过标准化提示生成、指令解析与跨任务评分机制，实现统一视频模型的公平、可扩展及可复现比较。通过构建基于指令的多镜头视频任务评估体系，UniVBench首次建立了衡量视频基础模型综合能力的框架。大量人工标注确保评估结果与人类判断一致，从而支撑严格性能评测并加速稳健视频智能技术的突破。

English

Video foundation models aim to integrate video understanding, generation, editing, and instruction following within a single framework, making them a central direction for next-generation multimodal systems. However, existing evaluation benchmarks remain fragmented and limited in scope, as they each target a single task, rely on task-specific metrics, and typically use short or simple video clips. As a result, they do not capture the unified capabilities that these models are designed to deliver. To address this gap, we introduce UniVBench, a benchmark purpose-built for evaluating video foundation models across four core abilities: video understanding, video generation, video editing, and a newly proposed task, video reconstruction, which assesses how faithfully a model can reproduce video content it has encountered. Our benchmark substantially expands the complexity of evaluation by incorporating 200 high-quality, diverse and multi-shot videos, each paired with detailed captions, multi-format editing instructions, and reference images. All videos are human-created and carefully validated, offering richer cinematic information than prior benchmarks. In addition, we develop a unified agentic evaluation system (UniV-Eval) that standardizes prompting, instruction parsing, and scoring across all tasks, enabling fair, scalable, and reproducible comparisons of unified video models. By grounding evaluation in instruction-based multi-shot video tasks, UniVBench provides the first framework for measuring the integrated capabilities that video foundation models aim to achieve. Extensive human annotations ensure our evaluation aligns with human judgment, enabling rigorous assessment and accelerating progress toward robust video intelligence.

UniVBench：面向视频基础模型的统一评估框架

UniVBench: Towards Unified Evaluation for Video Foundation Models

摘要

Support