幻影数据:迈向通用主体一致性视频生成的数据集构建
Phantom-Data : Towards a General Subject-Consistent Video Generation Dataset
June 23, 2025
作者: Zhuowei Chen, Bingchuan Li, Tianxiang Ma, Lijie Liu, Mingcong Liu, Yi Zhang, Gen Li, Xinghui Li, Siyu Zhou, Qian He, Xinglong Wu
cs.AI
摘要
近年来,主体到视频生成领域取得了显著进展。然而,现有模型在忠实遵循文本指令方面仍面临重大挑战。这一局限,通常被称为复制粘贴问题,源于广泛使用的配对训练范式。该方法通过从与目标视频相同的场景中采样参考图像,本质上将主体身份与背景和上下文属性纠缠在一起。为解决这一问题,我们引入了Phantom-Data,首个通用跨配对主体到视频一致性数据集,包含跨多样类别约一百万对身份一致的样本。我们的数据集通过三阶段流程构建:(1) 通用且输入对齐的主体检测模块,(2) 从超过5300万视频和30亿图像中进行大规模跨上下文主体检索,以及(3) 先验引导的身份验证,以确保在上下文变化下的视觉一致性。全面实验表明,使用Phantom-Data进行训练显著提升了提示对齐和视觉质量,同时保持了与配对基线相当的身份一致性。
English
Subject-to-video generation has witnessed substantial progress in recent
years. However, existing models still face significant challenges in faithfully
following textual instructions. This limitation, commonly known as the
copy-paste problem, arises from the widely used in-pair training paradigm. This
approach inherently entangles subject identity with background and contextual
attributes by sampling reference images from the same scene as the target
video. To address this issue, we introduce Phantom-Data, the first
general-purpose cross-pair subject-to-video consistency dataset, containing
approximately one million identity-consistent pairs across diverse categories.
Our dataset is constructed via a three-stage pipeline: (1) a general and
input-aligned subject detection module, (2) large-scale cross-context subject
retrieval from more than 53 million videos and 3 billion images, and (3)
prior-guided identity verification to ensure visual consistency under
contextual variation. Comprehensive experiments show that training with
Phantom-Data significantly improves prompt alignment and visual quality while
preserving identity consistency on par with in-pair baselines.