X-Omni:強化學習讓離散自回歸圖像生成模型重現輝煌
X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again
July 29, 2025
作者: Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, Linus, Di Wang, Jie Jiang
cs.AI
摘要
許多研究致力於將「下一個詞預測」範式擴展至視覺內容,旨在創建一種統一的方法來處理圖像生成與理解。然而,通過離散標記的自回歸建模來生成圖像的嘗試,一直受到視覺保真度低、輸出扭曲以及在渲染複雜細節時無法遵循複雜指令等問題的困擾。這些缺陷很可能歸因於自回歸推理過程中的累積誤差或離散化過程中的信息損失。或許正是由於這一挑戰,近期的研究逐漸轉向聯合訓練圖像生成(使用擴散目標)與語言生成(使用自回歸目標),而非採用統一建模方法。在本研究中,我們展示了強化學習能夠有效減輕偽影並大幅提升離散自回歸建模方法的生成質量,從而實現圖像與語言生成的無縫整合。我們的框架包括一個語義圖像標記器、一個適用於語言和圖像的統一自回歸模型,以及一個用於圖像生成的離線擴散解碼器,稱為X-Omni。X-Omni在圖像生成任務中,使用7B語言模型達到了最先進的性能,生成具有高美學質量的圖像,同時展現出強大的指令遵循能力和長文本渲染能力。
English
Numerous efforts have been made to extend the ``next token prediction''
paradigm to visual contents, aiming to create a unified approach for both image
generation and understanding. Nevertheless, attempts to generate images through
autoregressive modeling with discrete tokens have been plagued by issues such
as low visual fidelity, distorted outputs, and failure to adhere to complex
instructions when rendering intricate details. These shortcomings are likely
attributed to cumulative errors during autoregressive inference or information
loss incurred during the discretization process. Probably due to this
challenge, recent research has increasingly shifted toward jointly training
image generation with diffusion objectives and language generation with
autoregressive objectives, moving away from unified modeling approaches. In
this work, we demonstrate that reinforcement learning can effectively mitigate
artifacts and largely enhance the generation quality of a discrete
autoregressive modeling method, thereby enabling seamless integration of image
and language generation. Our framework comprises a semantic image tokenizer, a
unified autoregressive model for both language and images, and an offline
diffusion decoder for image generation, termed X-Omni. X-Omni achieves
state-of-the-art performance in image generation tasks using a 7B language
model, producing images with high aesthetic quality while exhibiting strong
capabilities in following instructions and rendering long texts.