擴散模型何時學會生成多個物件？

摘要

儘管文字轉影像擴散模型已實現令人印象深刻的視覺逼真度，但在多物件生成任務中仍存在不可靠性。雖然已有大量實證研究指出這些缺陷，其根本成因至今尚未明確。我們首先探討：這項侷限性在多大程度上源於資料本身？為釐清資料效應，我們針對不同規模的資料集設定兩種情境：(1) 概念泛化——訓練時觀察到每個獨立概念，但可能面臨資料分佈不平衡；(2) 組合泛化——在訓練過程中系統性排除特定概念組合。為研究這些情境，我們提出模組化控制框架 mosaic（多物件空間關係、屬性與計數）。透過在 mosaic 上訓練擴散模型，發現場景複雜度的影響遠大於概念不平衡，且計數能力在低資料量情境中具有獨特的學習困難。此外，當訓練時排除越多概念組合，組合泛化能力會急遽衰退。這些發現揭示了擴散模型的根本侷限性，並為建構更強健的多物件組合生成系統提供了強化歸納偏見與資料設計的理論依據。

English

Text-to-image diffusion models achieve impressive visual fidelity, yet they remain unreliable in multi-object generation. Despite extensive empirical evidence of these failures, the underlying causes remain unclear. We begin by asking how much of this limitation arises from the data itself. To disentangle data effects, we consider two regimes across different dataset sizes: (1) concept generalization, where each individual concept is observed during training under potentially imbalanced data distributions, and (2) compositional generalization, where specific combinations of concepts are systematically held out. To study these regimes, we introduce mosaic (Multi-Object Spatial relations, AttrIbution, Counting), a controlled framework for dataset generation. By training diffusion models on mosaic, we find that scene complexity plays a dominant role rather than concept imbalance, and that counting is uniquely difficult to learn in low-data regimes. Moreover, compositional generalization collapses as more concept combinations are held out during training. These findings highlight fundamental limitations of diffusion models and motivate stronger inductive biases and data design for robust multi-object compositional generation.

擴散模型何時學會生成多個物件？

When Do Diffusion Models learn to Generate Multiple Objects?

摘要

Support