拡散モデルはいつ複数のオブジェクト生成を学習するのか？

要旨

テキストから画像への拡散モデルは高い視覚的忠実性を達成しているが、多オブジェクト生成においては信頼性に欠ける。こうした失敗事例が数多く報告されているにもかかわらず、その根本的な原因は明らかになっていない。本研究では、この限界のどの程度がデータ自体に起因するのかという問いから始める。データの影響を解明するため、データセットサイズが異なる二つの領域を検討する：（1）個々の概念は訓練中に観測されるが、データ分布が不均衡である可能性がある「概念の一般化」、（2）概念の特定の組み合わせが体系的に訓練データから除外される「合成的な一般化」である。これらの領域を研究するため、データセット生成のための制御されたフレームワークであるMOSAIC（Multi-Object Spatial relations, AttrIbution, Counting）を提案する。MOSAICで拡散モデルを訓練した結果、概念の不均衡よりもシーンの複雑さが支配的な役割を果たし、特にデータが少ない状況では数の計数が学習困難であることがわかった。さらに、訓練中により多くの概念の組み合わせが除外されると、合成的な一般化は崩壊する。これらの発見は拡散モデルの根本的な限界を浮き彫りにしており、ロバストな多オブジェクト合成的生成のための、より強力な帰納バイアスとデータ設計の必要性を示唆する。

English

Text-to-image diffusion models achieve impressive visual fidelity, yet they remain unreliable in multi-object generation. Despite extensive empirical evidence of these failures, the underlying causes remain unclear. We begin by asking how much of this limitation arises from the data itself. To disentangle data effects, we consider two regimes across different dataset sizes: (1) concept generalization, where each individual concept is observed during training under potentially imbalanced data distributions, and (2) compositional generalization, where specific combinations of concepts are systematically held out. To study these regimes, we introduce mosaic (Multi-Object Spatial relations, AttrIbution, Counting), a controlled framework for dataset generation. By training diffusion models on mosaic, we find that scene complexity plays a dominant role rather than concept imbalance, and that counting is uniquely difficult to learn in low-data regimes. Moreover, compositional generalization collapses as more concept combinations are held out during training. These findings highlight fundamental limitations of diffusion models and motivate stronger inductive biases and data design for robust multi-object compositional generation.

拡散モデルはいつ複数のオブジェクト生成を学習するのか？

When Do Diffusion Models learn to Generate Multiple Objects?

要旨

Support