무작위 병렬 디코딩을 활용한 자기회귀적 이미지 생성

초록

우리는 ARPG라는 새로운 시각적 자기회귀 모델을 소개합니다. 이 모델은 기존의 래스터 순서 접근법의 본질적인 한계를 해결하여 무작위 병렬 생성을 가능하게 합니다. 기존 방식은 순차적이고 미리 정의된 토큰 생성 순서로 인해 추론 효율성과 제로샷 일반화를 저해했습니다. 우리의 핵심 통찰은 효과적인 무작위 순서 모델링이 다음에 예측할 토큰의 위치를 결정하기 위한 명시적 지침을 필요로 한다는 것입니다. 이를 위해, 우리는 위치 지침과 내용 표현을 분리하여 각각 쿼리와 키-값 쌍으로 인코딩하는 새로운 가이드 디코딩 프레임워크를 제안합니다. 이 지침을 인과적 주의 메커니즘에 직접 통합함으로써, 우리의 접근 방식은 양방향 주의가 필요 없는 완전한 무작위 순서 학습과 생성을 가능하게 합니다. 결과적으로, ARPG는 이미지 인페인팅, 아웃페인팅, 해상도 확장과 같은 제로샷 작업에 쉽게 일반화됩니다. 또한, 공유 KV 캐시를 사용하여 여러 쿼리를 동시에 처리함으로써 병렬 추론을 지원합니다. ImageNet-1K 256 벤치마크에서, 우리의 접근 방식은 단 64개의 샘플링 단계로 FID 1.94를 달성하며, 유사한 규모의 최근 대표적인 자기회귀 모델과 비교하여 처리량을 20배 이상 증가시키고 메모리 소비를 75% 이상 줄였습니다.

English

We introduce ARPG, a novel visual autoregressive model that enables randomized parallel generation, addressing the inherent limitations of conventional raster-order approaches, which hinder inference efficiency and zero-shot generalization due to their sequential, predefined token generation order. Our key insight is that effective random-order modeling necessitates explicit guidance for determining the position of the next predicted token. To this end, we propose a novel guided decoding framework that decouples positional guidance from content representation, encoding them separately as queries and key-value pairs. By directly incorporating this guidance into the causal attention mechanism, our approach enables fully random-order training and generation, eliminating the need for bidirectional attention. Consequently, ARPG readily generalizes to zero-shot tasks such as image inpainting, outpainting, and resolution expansion. Furthermore, it supports parallel inference by concurrently processing multiple queries using a shared KV cache. On the ImageNet-1K 256 benchmark, our approach attains an FID of 1.94 with only 64 sampling steps, achieving over a 20-fold increase in throughput while reducing memory consumption by over 75% compared to representative recent autoregressive models at a similar scale.

무작위 병렬 디코딩을 활용한 자기회귀적 이미지 생성

Autoregressive Image Generation with Randomized Parallel Decoding

초록

Support