UniControl：一种用于可控视觉生成的统一扩散模型在野外

摘要

在交互式人工智能系统设计中，实现机器自主性和人类控制往往代表着不同的目标。视觉生成基础模型，如稳定扩散（Stable Diffusion），展现了在处理这些目标时的潜力，特别是在接收任意语言提示时。然而，它们通常在生成具有空间、结构或几何控制的图像方面表现不佳。整合这些控制，以在单一统一模型中适应各种视觉条件，仍然是一个未解决的挑战。为此，我们引入了UniControl，这是一个新的生成基础模型，它在一个框架内整合了各种可控制的条件到图像（C2I）任务，同时仍允许接收任意语言提示。UniControl实现了像素级精确的图像生成，其中视觉条件主要影响生成的结构，而语言提示则指导风格和语境。为了使UniControl具备处理多样化视觉条件的能力，我们增强了预训练的文本到图像扩散模型，并引入了一个任务感知的HyperNet来调节扩散模型，使其能够同时适应不同的C2I任务。在九个独特的C2I任务上训练后，UniControl展示了令人印象深刻的零样本生成能力，可以处理未见过的视觉条件。实验结果显示，UniControl经常超越了相同模型大小的单任务控制方法的性能。这种控制多样性使UniControl成为可控制视觉生成领域的重大进展。

English

Achieving machine autonomy and human control often represent divergent objectives in the design of interactive AI systems. Visual generative foundation models such as Stable Diffusion show promise in navigating these goals, especially when prompted with arbitrary languages. However, they often fall short in generating images with spatial, structural, or geometric controls. The integration of such controls, which can accommodate various visual conditions in a single unified model, remains an unaddressed challenge. In response, we introduce UniControl, a new generative foundation model that consolidates a wide array of controllable condition-to-image (C2I) tasks within a singular framework, while still allowing for arbitrary language prompts. UniControl enables pixel-level-precise image generation, where visual conditions primarily influence the generated structures and language prompts guide the style and context. To equip UniControl with the capacity to handle diverse visual conditions, we augment pretrained text-to-image diffusion models and introduce a task-aware HyperNet to modulate the diffusion models, enabling the adaptation to different C2I tasks simultaneously. Trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities with unseen visual conditions. Experimental results show that UniControl often surpasses the performance of single-task-controlled methods of comparable model sizes. This control versatility positions UniControl as a significant advancement in the realm of controllable visual generation.

UniControl：一种用于可控视觉生成的统一扩散模型在野外

UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild

摘要

Support