ファントムデータ：一般的主観整合性ビデオ生成に向けたデータセット

要旨

被写体から動画生成への技術は近年大きな進展を遂げてきた。しかし、既存のモデルは依然としてテキスト指示に忠実に従う点において重大な課題に直面している。この制約は、一般的に「コピー＆ペースト問題」として知られており、広く用いられているペア内学習パラダイムに起因している。このアプローチでは、対象動画と同じシーンから参照画像をサンプリングすることにより、被写体のアイデンティティが背景や文脈的属性と不可分に結びついてしまう。この問題を解決するため、我々はPhantom-Dataを提案する。これは、多様なカテゴリにわたる約100万のアイデンティティ整合ペアを含む、初の汎用クロスペア被写体-動画整合性データセットである。本データセットは、以下の3段階のパイプラインを通じて構築された：(1)一般的かつ入力に整合した被写体検出モジュール、(2)5,300万以上の動画と30億枚の画像からの大規模クロスコンテキスト被写体検索、(3)文脈変動下での視覚的整合性を保証するための事前知識に基づくアイデンティティ検証。包括的な実験により、Phantom-Dataを用いた学習は、ペア内ベースラインと同等のアイデンティティ整合性を維持しつつ、プロンプトの整合性と視覚的品質を大幅に向上させることが示された。

English

Subject-to-video generation has witnessed substantial progress in recent years. However, existing models still face significant challenges in faithfully following textual instructions. This limitation, commonly known as the copy-paste problem, arises from the widely used in-pair training paradigm. This approach inherently entangles subject identity with background and contextual attributes by sampling reference images from the same scene as the target video. To address this issue, we introduce Phantom-Data, the first general-purpose cross-pair subject-to-video consistency dataset, containing approximately one million identity-consistent pairs across diverse categories. Our dataset is constructed via a three-stage pipeline: (1) a general and input-aligned subject detection module, (2) large-scale cross-context subject retrieval from more than 53 million videos and 3 billion images, and (3) prior-guided identity verification to ensure visual consistency under contextual variation. Comprehensive experiments show that training with Phantom-Data significantly improves prompt alignment and visual quality while preserving identity consistency on par with in-pair baselines.

ファントムデータ：一般的主観整合性ビデオ生成に向けたデータセット

Phantom-Data : Towards a General Subject-Consistent Video Generation Dataset

要旨

Support