ChatPaper.aiChatPaper

X-Omni:強化學習讓離散自回歸圖像生成模型重現輝煌

X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again

July 29, 2025
作者: Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, Linus, Di Wang, Jie Jiang
cs.AI

摘要

許多研究致力於將「下一個詞預測」範式擴展至視覺內容,旨在創建一種統一的方法來處理圖像生成與理解。然而,通過離散標記的自回歸建模來生成圖像的嘗試,一直受到視覺保真度低、輸出扭曲以及在渲染複雜細節時無法遵循複雜指令等問題的困擾。這些缺陷很可能歸因於自回歸推理過程中的累積誤差或離散化過程中的信息損失。或許正是由於這一挑戰,近期的研究逐漸轉向聯合訓練圖像生成(使用擴散目標)與語言生成(使用自回歸目標),而非採用統一建模方法。在本研究中,我們展示了強化學習能夠有效減輕偽影並大幅提升離散自回歸建模方法的生成質量,從而實現圖像與語言生成的無縫整合。我們的框架包括一個語義圖像標記器、一個適用於語言和圖像的統一自回歸模型,以及一個用於圖像生成的離線擴散解碼器,稱為X-Omni。X-Omni在圖像生成任務中,使用7B語言模型達到了最先進的性能,生成具有高美學質量的圖像,同時展現出強大的指令遵循能力和長文本渲染能力。
English
Numerous efforts have been made to extend the ``next token prediction'' paradigm to visual contents, aiming to create a unified approach for both image generation and understanding. Nevertheless, attempts to generate images through autoregressive modeling with discrete tokens have been plagued by issues such as low visual fidelity, distorted outputs, and failure to adhere to complex instructions when rendering intricate details. These shortcomings are likely attributed to cumulative errors during autoregressive inference or information loss incurred during the discretization process. Probably due to this challenge, recent research has increasingly shifted toward jointly training image generation with diffusion objectives and language generation with autoregressive objectives, moving away from unified modeling approaches. In this work, we demonstrate that reinforcement learning can effectively mitigate artifacts and largely enhance the generation quality of a discrete autoregressive modeling method, thereby enabling seamless integration of image and language generation. Our framework comprises a semantic image tokenizer, a unified autoregressive model for both language and images, and an offline diffusion decoder for image generation, termed X-Omni. X-Omni achieves state-of-the-art performance in image generation tasks using a 7B language model, producing images with high aesthetic quality while exhibiting strong capabilities in following instructions and rendering long texts.
PDF323July 30, 2025