SimpleAR: 사전 학습, SFT, RL을 통해 자동회귀 시각적 생성의 최전선을 밀어붙이다

초록

본 연구는 복잡한 아키텍처 수정 없이도 동작하는 기본적인 자기회귀(autoregressive) 시각 생성 프레임워크인 SimpleAR을 소개합니다. 학습 및 추론 최적화에 대한 세심한 탐구를 통해 다음과 같은 결과를 입증했습니다: 1) 단 0.5B 파라미터만으로도 1024x1024 해상도의 고품질 이미지를 생성할 수 있으며, GenEval에서 0.59, DPG에서 79.66 점을 기록하는 등 도전적인 텍스트-이미지 벤치마크에서 경쟁력 있는 성능을 달성했습니다; 2) 지도 미세조정(SFT)과 그룹 상대 정책 최적화(GRPO) 학습 모두 생성 미학과 프롬프트 정렬에서 상당한 개선을 이끌어냈습니다; 3) vLLM과 같은 추론 가속 기술을 적용할 경우, SimpleAR이 1024x1024 이미지를 생성하는 데 걸리는 시간을 약 14초까지 단축할 수 있었습니다. 이러한 발견을 공유하고 코드를 오픈소스로 제공함으로써, 우리는 자기회귀 시각 생성의 잠재력을 드러내고 이 연구 분야에 더 많은 참여를 독려하고자 합니다. 코드는 https://github.com/wdrink/SimpleAR에서 확인할 수 있습니다.

English

This work presents SimpleAR, a vanilla autoregressive visual generation framework without complex architecure modifications. Through careful exploration of training and inference optimization, we demonstrate that: 1) with only 0.5B parameters, our model can generate 1024x1024 resolution images with high fidelity, and achieve competitive results on challenging text-to-image benchmarks, e.g., 0.59 on GenEval and 79.66 on DPG; 2) both supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) training could lead to significant improvements on generation aesthectics and prompt alignment; and 3) when optimized with inference acceleraton techniques like vLLM, the time for SimpleAR to generate an 1024x1024 image could be reduced to around 14 seconds. By sharing these findings and open-sourcing the code, we hope to reveal the potential of autoregressive visual generation and encourage more participation in this research field. Code is available at https://github.com/wdrink/SimpleAR.

SimpleAR: 사전 학습, SFT, RL을 통해 자동회귀 시각적 생성의 최전선을 밀어붙이다

SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL

초록

Support