幻影數據：邁向通用主體一致性的視頻生成數據集

摘要

近年来，主题到视频生成领域取得了显著进展。然而，现有模型在忠实遵循文本指令方面仍面临重大挑战。这一局限，通常被称为复制粘贴问题，源于广泛使用的配对内训练范式。该方法通过从与目标视频相同场景中采样参考图像，本质上将主题身份与背景及上下文属性纠缠在一起。为解决这一问题，我们引入了Phantom-Data，这是首个通用跨配对主题到视频一致性数据集，包含跨多样类别约一百万对身份一致的样本。我们的数据集通过三阶段流程构建：(1) 一个通用且输入对齐的主题检测模块，(2) 从超过5300万视频和30亿图像中进行大规模跨上下文主题检索，以及(3) 先验引导的身份验证，以确保在上下文变化下的视觉一致性。综合实验表明，使用Phantom-Data进行训练显著提升了提示对齐和视觉质量，同时保持了与配对内基线相当的身份一致性。

English

Subject-to-video generation has witnessed substantial progress in recent years. However, existing models still face significant challenges in faithfully following textual instructions. This limitation, commonly known as the copy-paste problem, arises from the widely used in-pair training paradigm. This approach inherently entangles subject identity with background and contextual attributes by sampling reference images from the same scene as the target video. To address this issue, we introduce Phantom-Data, the first general-purpose cross-pair subject-to-video consistency dataset, containing approximately one million identity-consistent pairs across diverse categories. Our dataset is constructed via a three-stage pipeline: (1) a general and input-aligned subject detection module, (2) large-scale cross-context subject retrieval from more than 53 million videos and 3 billion images, and (3) prior-guided identity verification to ensure visual consistency under contextual variation. Comprehensive experiments show that training with Phantom-Data significantly improves prompt alignment and visual quality while preserving identity consistency on par with in-pair baselines.

幻影數據：邁向通用主體一致性的視頻生成數據集

Phantom-Data : Towards a General Subject-Consistent Video Generation Dataset

摘要

Support