NextFlow: 통합 순차 모델링이 활성화하는 멀티모달 이해 및 생성

초록

우리는 6조 개의 인터리브 텍스트-이미지 이산 토큰으로 학습된 통합 디코더 전용 자회귀 트랜스포머인 NextFlow를 제안한다. 통합 자회귀 아키텍처 내에서 통합된 시각 표현을 활용함으로써 NextFlow는 멀티모달 이해 및 생성 능력을 기본적으로 활성화하여 이미지 편집, 인터리브 콘텐츠 및 비디오 생성 능력을 구현한다. 텍스트는 엄격하게 순차적이고 이미지는 본질적으로 계층적이라는 모달리티의 차별적 특성에 착안하여, 우리는 텍스트에 대해서는 다음 토큰 예측을 유지하지만 시각 생성에는 다음 스케일 예측을 채택한다. 이는 기존의 래스터 스캔 방식과 차별화되어 1024x1024 이미지를 단 5초 만에 생성할 수 있으며, 이는 유사한 AR 모델 대비 획기적인 속도 향상이다. 우리는 강건한 학습 레시피를 통해 다중 스케일 생성의 불안정성을 해결한다. 또한 강화 학습을 위한 프리픽스 튜닝 전략을 도입한다. 실험 결과, NextFlow는 통합 모델 중 최첨단 성능을 달성하며 시각적 품질에 있어 전문적인 디퓨전 베이스라인과 대등한 성과를 보인다.

English

We present NextFlow, a unified decoder-only autoregressive transformer trained on 6 trillion interleaved text-image discrete tokens. By leveraging a unified vision representation within a unified autoregressive architecture, NextFlow natively activates multimodal understanding and generation capabilities, unlocking abilities of image editing, interleaved content and video generation. Motivated by the distinct nature of modalities - where text is strictly sequential and images are inherently hierarchical - we retain next-token prediction for text but adopt next-scale prediction for visual generation. This departs from traditional raster-scan methods, enabling the generation of 1024x1024 images in just 5 seconds - orders of magnitude faster than comparable AR models. We address the instabilities of multi-scale generation through a robust training recipe. Furthermore, we introduce a prefix-tuning strategy for reinforcement learning. Experiments demonstrate that NextFlow achieves state-of-the-art performance among unified models and rivals specialized diffusion baselines in visual quality.

NextFlow: 통합 순차 모델링이 활성화하는 멀티모달 이해 및 생성

NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation

초록

Support