MACRO: 構造化された長文脈データによるマルチ参照画像生成の進展

要旨

複数の視覚的参照条件に基づく画像生成は、多被写体合成、物語的イラスト制作、新視点合成といった実世界アプリケーションにおいて重要であるが、現在のモデルは入力参照数が増加するにつれて性能が著しく劣化する課題を抱えている。我々は、その根本原因がデータの根本的ボトルネックにあると特定した。既存のデータセットは単一または少数の参照ペアが支配的であり、密な参照間依存関係を学習するために必要な構造化された長文脈の監督情報を欠いている。この問題に対処するため、我々はMacroDataを導入する。これは40万サンプルからなる大規模データセットであり、各サンプルには最大10枚の参照画像を含み、多参照生成空間を包括的にカバーするために、カスタマイゼーション、イラストレーション、空間推論、時間的ダイナミクスという4つの相補的次元に体系的に組織化されている。さらに、標準化された評価プロトコルの同時欠如を認識し、段階的なタスク次元と入力規模にわたる生成的コヒーレンスを評価する4,000サンプルのベンチマークMacroBenchを提案する。大規模な実験により、MacroDataでのファインチューニングが多参照生成において大幅な改善をもたらすことが示され、アブレーション研究はさらに、タスク横断的共同学習の相乗的利点と長文脈複雑性を処理する効果的戦略を明らかにしている。データセットとベンチマークは公開予定である。

English

Generating images conditioned on multiple visual references is critical for real-world applications such as multi-subject composition, narrative illustration, and novel view synthesis, yet current models suffer from severe performance degradation as the number of input references grows. We identify the root cause as a fundamental data bottleneck: existing datasets are dominated by single- or few-reference pairs and lack the structured, long-context supervision needed to learn dense inter-reference dependencies. To address this, we introduce MacroData, a large-scale dataset of 400K samples, each containing up to 10 reference images, systematically organized across four complementary dimensions -- Customization, Illustration, Spatial reasoning, and Temporal dynamics -- to provide comprehensive coverage of the multi-reference generation space. Recognizing the concurrent absence of standardized evaluation protocols, we further propose MacroBench, a benchmark of 4,000 samples that assesses generative coherence across graded task dimensions and input scales. Extensive experiments show that fine-tuning on MacroData yields substantial improvements in multi-reference generation, and ablation studies further reveal synergistic benefits of cross-task co-training and effective strategies for handling long-context complexity. The dataset and benchmark will be publicly released.

MACRO: 構造化された長文脈データによるマルチ参照画像生成の進展

MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data

要旨

Support