X-Omni:强化学习让离散自回归图像生成模型重焕新生
X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again
July 29, 2025
作者: Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, Linus, Di Wang, Jie Jiang
cs.AI
摘要
众多研究致力于将“下一标记预测”范式扩展至视觉内容领域,旨在构建一种统一的方法,同时实现图像生成与理解。然而,通过离散标记进行自回归建模来生成图像的尝试,普遍面临视觉保真度低、输出失真以及在渲染复杂细节时难以遵循复杂指令等问题。这些不足很可能归因于自回归推理过程中的累积误差或离散化过程中的信息丢失。或许正是由于这一挑战,近期研究逐渐转向将图像生成与扩散目标联合训练,同时保持语言生成的自回归目标,从而偏离了统一建模的路径。在本研究中,我们展示了强化学习能够有效减少伪影,显著提升离散自回归建模方法的生成质量,进而实现图像与语言生成的无缝整合。我们的框架包含一个语义图像标记器、一个适用于语言和图像的统一自回归模型,以及一个用于图像生成的离线扩散解码器,命名为X-Omni。X-Omni在图像生成任务中,利用7B规模的语言模型取得了业界领先的性能,不仅生成了具有高美学质量的图像,还展现出强大的指令遵循能力和长文本渲染能力。
English
Numerous efforts have been made to extend the ``next token prediction''
paradigm to visual contents, aiming to create a unified approach for both image
generation and understanding. Nevertheless, attempts to generate images through
autoregressive modeling with discrete tokens have been plagued by issues such
as low visual fidelity, distorted outputs, and failure to adhere to complex
instructions when rendering intricate details. These shortcomings are likely
attributed to cumulative errors during autoregressive inference or information
loss incurred during the discretization process. Probably due to this
challenge, recent research has increasingly shifted toward jointly training
image generation with diffusion objectives and language generation with
autoregressive objectives, moving away from unified modeling approaches. In
this work, we demonstrate that reinforcement learning can effectively mitigate
artifacts and largely enhance the generation quality of a discrete
autoregressive modeling method, thereby enabling seamless integration of image
and language generation. Our framework comprises a semantic image tokenizer, a
unified autoregressive model for both language and images, and an offline
diffusion decoder for image generation, termed X-Omni. X-Omni achieves
state-of-the-art performance in image generation tasks using a 7B language
model, producing images with high aesthetic quality while exhibiting strong
capabilities in following instructions and rendering long texts.