ControlAR:使用自回归模型进行可控图像生成
ControlAR: Controllable Image Generation with Autoregressive Models
October 3, 2024
作者: Zongming Li, Tianheng Cheng, Shoufa Chen, Peize Sun, Haocheng Shen, Longjin Ran, Xiaoxin Chen, Wenyu Liu, Xinggang Wang
cs.AI
摘要
自回归(AR)模型已将图像生成重新构想为下一个标记的预测,展示出显著潜力,并成为扩散模型的强大竞争对手。然而,类似于ControlNet的控制到图像生成在AR模型中仍然很少被探索。受大型语言模型进展的启发,一个自然的方法是将控制图像标记化为标记,并在解码图像标记之前将其预先填充到自回归模型中,但与ControlNet相比,仍然在生成质量上存在不足,并且效率低下。为此,我们引入了ControlAR,这是一个将空间控制集成到自回归图像生成模型中的高效且有效的框架。首先,我们探讨了AR模型的控制编码,并提出了一个轻量级的控制编码器,将空间输入(如canny边缘或深度图)转换为控制标记。然后,ControlAR利用条件解码方法,在控制标记和图像标记之间进行逐标记融合来生成下一个图像标记,类似于位置编码。与预先填充标记相比,使用条件解码显著增强了AR模型的控制能力,同时保持了模型的效率。此外,所提出的ControlAR通过条件解码和特定控制使AR模型惊人地实现了任意分辨率的图像生成。大量实验表明,所提出的ControlAR对于自回归控制到图像生成在各种输入上具有可控性,包括边缘、深度和分割蒙版。此外,定量和定性结果表明,ControlAR超越了先前的最先进的可控扩散模型,例如ControlNet++。代码、模型和演示将很快在https://github.com/hustvl/ControlAR 上提供。
English
Autoregressive (AR) models have reformulated image generation as next-token
prediction, demonstrating remarkable potential and emerging as strong
competitors to diffusion models. However, control-to-image generation, akin to
ControlNet, remains largely unexplored within AR models. Although a natural
approach, inspired by advancements in Large Language Models, is to tokenize
control images into tokens and prefill them into the autoregressive model
before decoding image tokens, it still falls short in generation quality
compared to ControlNet and suffers from inefficiency. To this end, we introduce
ControlAR, an efficient and effective framework for integrating spatial
controls into autoregressive image generation models. Firstly, we explore
control encoding for AR models and propose a lightweight control encoder to
transform spatial inputs (e.g., canny edges or depth maps) into control tokens.
Then ControlAR exploits the conditional decoding method to generate the next
image token conditioned on the per-token fusion between control and image
tokens, similar to positional encodings. Compared to prefilling tokens, using
conditional decoding significantly strengthens the control capability of AR
models but also maintains the model's efficiency. Furthermore, the proposed
ControlAR surprisingly empowers AR models with arbitrary-resolution image
generation via conditional decoding and specific controls. Extensive
experiments can demonstrate the controllability of the proposed ControlAR for
the autoregressive control-to-image generation across diverse inputs, including
edges, depths, and segmentation masks. Furthermore, both quantitative and
qualitative results indicate that ControlAR surpasses previous state-of-the-art
controllable diffusion models, e.g., ControlNet++. Code, models, and demo will
soon be available at https://github.com/hustvl/ControlAR.Summary
AI-Generated Summary