Less-to-More一般化：インコンテキスト生成による制御性の向上

要旨

主題駆動型生成はその幅広い応用可能性から画像生成分野で広く探求されてきたが、データのスケーラビリティと主題の拡張性において依然として課題を抱えている。第一の課題として、単一主題のデータセットから複数主題のデータセットへの移行とそのスケーリングは特に困難である。第二の課題として、最近の手法の多くは単一主題の生成に焦点を当てており、複数主題のシナリオに対応するのが難しい。本研究では、この課題に対処するため、高度に一貫性のあるデータ合成パイプラインを提案する。このパイプラインは拡散トランスフォーマーの内在的な文脈内生成能力を活用し、高一貫性の複数主題ペアデータを生成する。さらに、プログレッシブなクロスモーダルアライメントとユニバーサルロータリーポジション埋め込みから構成されるUNOを導入する。これはテキストから画像へのモデルから反復的に訓練された、複数画像条件付きの主題から画像へのモデルである。大規模な実験により、本手法が単一主題および複数主題駆動型生成の両方において、高い一貫性を保ちつつ制御性を確保できることが示された。

English

Although subject-driven generation has been extensively explored in image generation due to its wide applications, it still has challenges in data scalability and subject expansibility. For the first challenge, moving from curating single-subject datasets to multiple-subject ones and scaling them is particularly difficult. For the second, most recent methods center on single-subject generation, making it hard to apply when dealing with multi-subject scenarios. In this study, we propose a highly-consistent data synthesis pipeline to tackle this challenge. This pipeline harnesses the intrinsic in-context generation capabilities of diffusion transformers and generates high-consistency multi-subject paired data. Additionally, we introduce UNO, which consists of progressive cross-modal alignment and universal rotary position embedding. It is a multi-image conditioned subject-to-image model iteratively trained from a text-to-image model. Extensive experiments show that our method can achieve high consistency while ensuring controllability in both single-subject and multi-subject driven generation.

Less-to-More一般化：インコンテキスト生成による制御性の向上

Less-to-More Generalization: Unlocking More Controllability by In-Context Generation

要旨

Support