UniControl：一個統一的擴散模型，用於可控的視覺生成在野外

摘要

在互動式人工智慧系統設計中，實現機器自主性和人類控制通常代表著不同的目標。視覺生成基礎模型，如穩定擴散，展示了在應對任意語言提示時導航這些目標的潛力。然而，它們在生成具有空間、結構或幾何控制的圖像方面常常表現不佳。整合這些控制，能夠在單一統一模型中應對各種視覺條件，仍然是一個未解決的挑戰。為此，我們引入了UniControl，一種新的生成基礎模型，將各種可控條件到圖像（C2I）任務統合到單一框架中，同時仍允許任意語言提示。UniControl實現了像素級精確的圖像生成，其中視覺條件主要影響生成的結構，而語言提示則引導風格和內容。為了使UniControl具備處理多樣視覺條件的能力，我們擴充了預訓練的文本到圖像擴散模型，並引入了一個任務感知的HyperNet來調節擴散模型，實現對不同C2I任務的同時適應。在九個獨特的C2I任務上訓練後，UniControl展示了令人印象深刻的零樣本生成能力，可以應對未見過的視覺條件。實驗結果顯示，UniControl通常優於相同模型大小的單任務控制方法的性能。這種控制多功能性使UniControl成為可控視覺生成領域的一個重大進步。

English

Achieving machine autonomy and human control often represent divergent objectives in the design of interactive AI systems. Visual generative foundation models such as Stable Diffusion show promise in navigating these goals, especially when prompted with arbitrary languages. However, they often fall short in generating images with spatial, structural, or geometric controls. The integration of such controls, which can accommodate various visual conditions in a single unified model, remains an unaddressed challenge. In response, we introduce UniControl, a new generative foundation model that consolidates a wide array of controllable condition-to-image (C2I) tasks within a singular framework, while still allowing for arbitrary language prompts. UniControl enables pixel-level-precise image generation, where visual conditions primarily influence the generated structures and language prompts guide the style and context. To equip UniControl with the capacity to handle diverse visual conditions, we augment pretrained text-to-image diffusion models and introduce a task-aware HyperNet to modulate the diffusion models, enabling the adaptation to different C2I tasks simultaneously. Trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities with unseen visual conditions. Experimental results show that UniControl often surpasses the performance of single-task-controlled methods of comparable model sizes. This control versatility positions UniControl as a significant advancement in the realm of controllable visual generation.

UniControl：一個統一的擴散模型，用於可控的視覺生成在野外

UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild

摘要

Support