UniControl: 制御可能な視覚生成のための統合拡散モデル実環境における応用

要旨

機械の自律性と人間の制御を実現することは、インタラクティブAIシステムの設計においてしばしば相反する目標として表れます。Stable Diffusionのような視覚生成基盤モデルは、特に任意の言語でプロンプトされた場合に、これらの目標を達成する可能性を示しています。しかし、空間的、構造的、幾何学的な制御を伴う画像生成においては、しばしば不十分な結果に終わります。様々な視覚条件を単一の統一モデルに統合するような制御の統合は、未解決の課題として残されています。これに対応して、我々はUniControlを紹介します。これは、任意の言語プロンプトを可能にしつつ、多様な制御可能な条件から画像（C2I）タスクを単一のフレームワークに統合する新しい生成基盤モデルです。UniControlは、ピクセルレベルで正確な画像生成を可能にし、視覚条件が生成される構造に主に影響を与え、言語プロンプトがスタイルと文脈を導きます。UniControlに多様な視覚条件を処理する能力を備えさせるため、我々は事前学習済みのテキストから画像への拡散モデルを拡張し、異なるC2Iタスクに同時に適応できるように拡散モデルを調整するタスク対応型HyperNetを導入しました。9つの異なるC2Iタスクで学習されたUniControlは、未見の視覚条件に対して印象的なゼロショット生成能力を示します。実験結果は、UniControlが同等のモデルサイズの単一タスク制御手法の性能をしばしば上回ることを示しています。この制御の多様性により、UniControlは制御可能な視覚生成の分野における重要な進展として位置づけられます。

English

Achieving machine autonomy and human control often represent divergent objectives in the design of interactive AI systems. Visual generative foundation models such as Stable Diffusion show promise in navigating these goals, especially when prompted with arbitrary languages. However, they often fall short in generating images with spatial, structural, or geometric controls. The integration of such controls, which can accommodate various visual conditions in a single unified model, remains an unaddressed challenge. In response, we introduce UniControl, a new generative foundation model that consolidates a wide array of controllable condition-to-image (C2I) tasks within a singular framework, while still allowing for arbitrary language prompts. UniControl enables pixel-level-precise image generation, where visual conditions primarily influence the generated structures and language prompts guide the style and context. To equip UniControl with the capacity to handle diverse visual conditions, we augment pretrained text-to-image diffusion models and introduce a task-aware HyperNet to modulate the diffusion models, enabling the adaptation to different C2I tasks simultaneously. Trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities with unseen visual conditions. Experimental results show that UniControl often surpasses the performance of single-task-controlled methods of comparable model sizes. This control versatility positions UniControl as a significant advancement in the realm of controllable visual generation.

UniControl: 制御可能な視覚生成のための統合拡散モデル実環境における応用

UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild

要旨

Support