X-Omni: 強化学習による離散自己回帰画像生成モデルの再興

要旨

「次のトークン予測」のパラダイムを視覚コンテンツに拡張し、画像生成と理解の両方に対する統一的なアプローチを構築するための数多くの試みがなされてきた。しかし、離散トークンを用いた自己回帰モデリングによる画像生成の試みは、視覚的な忠実度の低さ、歪んだ出力、複雑な指示に従わない詳細のレンダリングといった問題に悩まされてきた。これらの欠点は、自己回帰推論中の累積誤差や離散化プロセスにおける情報損失に起因すると考えられる。おそらくこの課題のため、最近の研究は統一モデリングアプローチから離れ、拡散目標を用いた画像生成と自己回帰目標を用いた言語生成を共同で訓練する方向にシフトしつつある。本研究では、強化学習が離散自己回帰モデリング手法のアーティファクトを効果的に軽減し、生成品質を大幅に向上させることで、画像と言語生成のシームレスな統合を可能にすることを示す。我々のフレームワークは、セマンティック画像トークナイザー、言語と画像の両方に対する統一自己回帰モデル、および画像生成のためのオフライン拡散デコーダーから構成され、X-Omniと名付けられている。X-Omniは、7Bの言語モデルを用いて画像生成タスクにおいて最先端の性能を達成し、高い美的品質の画像を生成するとともに、指示に従う能力や長文のレンダリングにおいても優れた能力を示す。

English

Numerous efforts have been made to extend the ``next token prediction'' paradigm to visual contents, aiming to create a unified approach for both image generation and understanding. Nevertheless, attempts to generate images through autoregressive modeling with discrete tokens have been plagued by issues such as low visual fidelity, distorted outputs, and failure to adhere to complex instructions when rendering intricate details. These shortcomings are likely attributed to cumulative errors during autoregressive inference or information loss incurred during the discretization process. Probably due to this challenge, recent research has increasingly shifted toward jointly training image generation with diffusion objectives and language generation with autoregressive objectives, moving away from unified modeling approaches. In this work, we demonstrate that reinforcement learning can effectively mitigate artifacts and largely enhance the generation quality of a discrete autoregressive modeling method, thereby enabling seamless integration of image and language generation. Our framework comprises a semantic image tokenizer, a unified autoregressive model for both language and images, and an offline diffusion decoder for image generation, termed X-Omni. X-Omni achieves state-of-the-art performance in image generation tasks using a 7B language model, producing images with high aesthetic quality while exhibiting strong capabilities in following instructions and rendering long texts.

X-Omni: 強化学習による離散自己回帰画像生成モデルの再興

X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again

要旨

Support