OpenS2V-Nexus：面向主题到视频生成的详尽基准与百万级数据集

摘要

主题到视频（S2V）生成旨在创建能够忠实融入参考内容的视频，为视频制作提供更高的灵活性。为构建S2V生成的基础设施，我们提出了OpenS2V-Nexus，包含（i）OpenS2V-Eval，一个细粒度基准测试，以及（ii）OpenS2V-5M，一个百万规模的数据集。与继承自VBench的现有S2V基准测试侧重于对生成视频进行全局和粗粒度评估不同，OpenS2V-Eval聚焦于模型生成主题一致视频的能力，确保主题外观自然且身份保真。为此，OpenS2V-Eval引入了来自七大S2V类别的180个提示，融合了真实与合成的测试数据。此外，为精准对齐人类偏好与S2V基准，我们提出了三个自动化指标——NexusScore、NaturalScore和GmeScore，分别量化生成视频中的主题一致性、自然度及文本相关性。基于此，我们对16个代表性S2V模型进行了全面评估，揭示了它们在不同内容上的优势与不足。同时，我们创建了首个开源的大规模S2V生成数据集OpenS2V-5M，包含五百万个高质量的720P主题-文本-视频三元组。具体而言，我们通过（1）分割主题并通过跨视频关联构建配对信息，以及（2）利用GPT-Image-1对原始帧进行提示以合成多视角表示，确保了数据集中主题信息的多样性。通过OpenS2V-Nexus，我们为加速未来S2V生成研究提供了坚实的基础设施。

English

Subject-to-Video (S2V) generation aims to create videos that faithfully incorporate reference content, providing enhanced flexibility in the production of videos. To establish the infrastructure for S2V generation, we propose OpenS2V-Nexus, consisting of (i) OpenS2V-Eval, a fine-grained benchmark, and (ii) OpenS2V-5M, a million-scale dataset. In contrast to existing S2V benchmarks inherited from VBench that focus on global and coarse-grained assessment of generated videos, OpenS2V-Eval focuses on the model's ability to generate subject-consistent videos with natural subject appearance and identity fidelity. For these purposes, OpenS2V-Eval introduces 180 prompts from seven major categories of S2V, which incorporate both real and synthetic test data. Furthermore, to accurately align human preferences with S2V benchmarks, we propose three automatic metrics, NexusScore, NaturalScore and GmeScore, to separately quantify subject consistency, naturalness, and text relevance in generated videos. Building on this, we conduct a comprehensive evaluation of 16 representative S2V models, highlighting their strengths and weaknesses across different content. Moreover, we create the first open-source large-scale S2V generation dataset OpenS2V-5M, which consists of five million high-quality 720P subject-text-video triples. Specifically, we ensure subject-information diversity in our dataset by (1) segmenting subjects and building pairing information via cross-video associations and (2) prompting GPT-Image-1 on raw frames to synthesize multi-view representations. Through OpenS2V-Nexus, we deliver a robust infrastructure to accelerate future S2V generation research.

OpenS2V-Nexus：面向主题到视频生成的详尽基准与百万级数据集

OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation

摘要

Support