제어 가능한 자기 회귀 모델을 사용한 이미지 생성(ControlAR)

초록

자기회귀(AR) 모델은 이미지 생성을 다음 토큰 예측으로 재정의하여 현저한 잠재력을 보여주며 확산 모델에 대항하는 강력한 경쟁자로 부상했습니다. 그러나 ControlNet과 유사한 제어-이미지 생성은 아직 AR 모델 내에서 크게 탐구되지 않았습니다. 대형 언어 모델의 발전에서 영감을 받은 자연스러운 접근 방식은 제어 이미지를 토큰으로 분할하고 이미지 토큰을 디코딩하기 전에 자기회귀 모델에 사전 채워넣는 것입니다. 그러나 이 방법은 ControlNet에 비해 생성 품질이 부족하고 효율성에도 문제가 있습니다. 이에 우리는 ControlAR을 소개합니다. 이는 효율적이고 효과적인 프레임워크로 공간 제어를 자기회귀 이미지 생성 모델에 통합하는 것입니다. 먼저, 우리는 AR 모델을 위한 제어 인코딩을 탐구하고 경량 제어 인코더를 제안하여 공간 입력(예: 캐니 가장자리 또는 깊이 맵)을 제어 토큰으로 변환합니다. 그런 다음 ControlAR은 조건부 디코딩 방법을 활용하여 제어 및 이미지 토큰 사이의 토큰 단위 융합에 의존하여 다음 이미지 토큰을 생성합니다. 이는 위치 인코딩과 유사합니다. 토큰 사전 채워넣기 대신 조건부 디코딩을 사용하면 AR 모델의 제어 능력이 크게 강화되지만 모델의 효율성도 유지됩니다. 또한 제안된 ControlAR은 조건부 디코딩과 특정 제어를 통해 임의 해상도 이미지 생성을 가능하게 하여 AR 모델을 놀라운 방식으로 강화합니다. 포함된 실험은 ControlAR이 다양한 입력(가장자리, 깊이, 분할 마스크 등)을 통해 자기회귀 제어-이미지 생성에 대한 조절 가능성을 입증합니다. 더불어 양적 및 질적 결과 모두 ControlAR이 이전 최첨단 제어 가능 확산 모델인 ControlNet++을 능가한다는 것을 보여줍니다. 코드, 모델 및 데모는 곧 https://github.com/hustvl/ControlAR에서 제공될 예정입니다.

English

Autoregressive (AR) models have reformulated image generation as next-token prediction, demonstrating remarkable potential and emerging as strong competitors to diffusion models. However, control-to-image generation, akin to ControlNet, remains largely unexplored within AR models. Although a natural approach, inspired by advancements in Large Language Models, is to tokenize control images into tokens and prefill them into the autoregressive model before decoding image tokens, it still falls short in generation quality compared to ControlNet and suffers from inefficiency. To this end, we introduce ControlAR, an efficient and effective framework for integrating spatial controls into autoregressive image generation models. Firstly, we explore control encoding for AR models and propose a lightweight control encoder to transform spatial inputs (e.g., canny edges or depth maps) into control tokens. Then ControlAR exploits the conditional decoding method to generate the next image token conditioned on the per-token fusion between control and image tokens, similar to positional encodings. Compared to prefilling tokens, using conditional decoding significantly strengthens the control capability of AR models but also maintains the model's efficiency. Furthermore, the proposed ControlAR surprisingly empowers AR models with arbitrary-resolution image generation via conditional decoding and specific controls. Extensive experiments can demonstrate the controllability of the proposed ControlAR for the autoregressive control-to-image generation across diverse inputs, including edges, depths, and segmentation masks. Furthermore, both quantitative and qualitative results indicate that ControlAR surpasses previous state-of-the-art controllable diffusion models, e.g., ControlNet++. Code, models, and demo will soon be available at https://github.com/hustvl/ControlAR.

제어 가능한 자기 회귀 모델을 사용한 이미지 생성(ControlAR)

ControlAR: Controllable Image Generation with Autoregressive Models

초록

Support