扩散模型何时学会生成多个对象？

摘要

文本到图像扩散模型在视觉保真度方面取得了显著成就，但在多目标生成任务中仍存在不可靠性。尽管已有大量实证研究揭示了这些缺陷，其根本原因仍不明确。我们首先探究这种局限性在多大程度上源于数据本身。为厘清数据影响，我们在不同数据规模下考察两种机制：（1）概念泛化——每个独立概念在训练过程中均被观测到，但可能处于不平衡的数据分布；（2）组合泛化——特定概念组合被系统性排除在训练集外。为研究这些机制，我们提出mosaic（多目标空间关系、属性与计数）这一可控数据集生成框架。通过在mosaic上训练扩散模型，我们发现场景复杂度的影响远超概念不平衡，且计数能力在低数据量场景中具有独特的学习难度。此外，随着更多概念组合在训练阶段被排除，组合泛化性能会急剧下降。这些发现揭示了扩散模型的基础局限性，为构建更强归纳偏置和优化数据设计以实现稳健的多目标组合生成提供了理论依据。

English

Text-to-image diffusion models achieve impressive visual fidelity, yet they remain unreliable in multi-object generation. Despite extensive empirical evidence of these failures, the underlying causes remain unclear. We begin by asking how much of this limitation arises from the data itself. To disentangle data effects, we consider two regimes across different dataset sizes: (1) concept generalization, where each individual concept is observed during training under potentially imbalanced data distributions, and (2) compositional generalization, where specific combinations of concepts are systematically held out. To study these regimes, we introduce mosaic (Multi-Object Spatial relations, AttrIbution, Counting), a controlled framework for dataset generation. By training diffusion models on mosaic, we find that scene complexity plays a dominant role rather than concept imbalance, and that counting is uniquely difficult to learn in low-data regimes. Moreover, compositional generalization collapses as more concept combinations are held out during training. These findings highlight fundamental limitations of diffusion models and motivate stronger inductive biases and data design for robust multi-object compositional generation.

扩散模型何时学会生成多个对象？

When Do Diffusion Models learn to Generate Multiple Objects?

摘要

Support