ChatPaper.aiChatPaper

ControlAR:使用自回歸模型進行可控圖像生成

ControlAR: Controllable Image Generation with Autoregressive Models

October 3, 2024
作者: Zongming Li, Tianheng Cheng, Shoufa Chen, Peize Sun, Haocheng Shen, Longjin Ran, Xiaoxin Chen, Wenyu Liu, Xinggang Wang
cs.AI

摘要

自回歸(AR)模型已將圖像生成重新定義為下一個標記的預測,展示出卓越的潛力並成為擴散模型的強大競爭對手。然而,類似ControlNet的控制到圖像生成在AR模型中仍然很少被探索。儘管受到大型語言模型進展的啟發,一種自然的方法是將控制圖像分詞為標記,並在解碼圖像標記之前將它們預先填充到自回歸模型中,但與ControlNet相比,這種方法在生成質量上仍然存在不足之處並且效率低下。為此,我們引入ControlAR,這是一個將空間控制整合到自回歸圖像生成模型中的高效且有效的框架。首先,我們探索了AR模型的控制編碼並提出了一個輕量級的控制編碼器,將空間輸入(例如canny邊緣或深度圖)轉換為控制標記。然後,ControlAR利用條件解碼方法,在控制標記和圖像標記之間進行逐標記融合的條件解碼,類似於位置編碼。與預先填充標記相比,使用條件解碼顯著增強了AR模型的控制能力,同時也保持了模型的效率。此外,所提出的ControlAR令AR模型驚人地通過條件解碼和特定控制實現了任意分辨率的圖像生成。大量實驗證明了所提出的ControlAR對於自回歸控制到圖像生成在各種輸入上的可控性,包括邊緣、深度和分割遮罩。此外,定量和定性結果均表明ControlAR超越了先前的最先進的可控擴散模型,例如ControlNet++。代碼、模型和演示將很快在https://github.com/hustvl/ControlAR 上提供。
English
Autoregressive (AR) models have reformulated image generation as next-token prediction, demonstrating remarkable potential and emerging as strong competitors to diffusion models. However, control-to-image generation, akin to ControlNet, remains largely unexplored within AR models. Although a natural approach, inspired by advancements in Large Language Models, is to tokenize control images into tokens and prefill them into the autoregressive model before decoding image tokens, it still falls short in generation quality compared to ControlNet and suffers from inefficiency. To this end, we introduce ControlAR, an efficient and effective framework for integrating spatial controls into autoregressive image generation models. Firstly, we explore control encoding for AR models and propose a lightweight control encoder to transform spatial inputs (e.g., canny edges or depth maps) into control tokens. Then ControlAR exploits the conditional decoding method to generate the next image token conditioned on the per-token fusion between control and image tokens, similar to positional encodings. Compared to prefilling tokens, using conditional decoding significantly strengthens the control capability of AR models but also maintains the model's efficiency. Furthermore, the proposed ControlAR surprisingly empowers AR models with arbitrary-resolution image generation via conditional decoding and specific controls. Extensive experiments can demonstrate the controllability of the proposed ControlAR for the autoregressive control-to-image generation across diverse inputs, including edges, depths, and segmentation masks. Furthermore, both quantitative and qualitative results indicate that ControlAR surpasses previous state-of-the-art controllable diffusion models, e.g., ControlNet++. Code, models, and demo will soon be available at https://github.com/hustvl/ControlAR.

Summary

AI-Generated Summary

PDF112November 16, 2024