RankE: デコーダ共進化による離散テキスト・ツー・イメージ生成のためのエンドツーエンドポストトレーニング

要旨

離散自己回帰（AR）テキスト画像生成（T2I）モデルは、VQトークナイザとARポリシーを組み合わせたものであり、現在のポストトレーニングパイプラインでは、VQデコーダを固定したままポリシーのみを最適化しています。REPA-Eに代表される近年の拡散T2I研究では、VAEそのものが重要なアラインメントのボトルネックを構成することが示されていますが、離散ARモデルに関する同様の調査は未だ行われていません。本稿では、ポリシーのみの最適化が「潜在共変量シフト」を引き起こすことを示します。すなわち、ポリシーが進化するにつれて、生成されるトークン分布がデコーダの学習に用いられた真の分布から乖離し、報酬スコアは向上する一方で、復号された画像品質は低下するのです。このミスマッチに対処するため、本稿では離散T2I生成のための初のエンドツーエンドポストトレーニングフレームワークであるRankEを提案します。RankEは、固定デコーダに対してポリシーを最適化するのではなく、交互最適化によって両方のコンポーネントを共進化させます。すなわち、各モジュールがランキングベースのアラインメント目的関数を最大化しつつ、自身のパラメータ空間に適した安定性を保持するアンカーによって正則化されます。この共進化により、デコーダ固定アプローチに付きまとう「忠実度とアラインメントのトレードオフ」が解消されます。LlamaGen-XL（775M）において、標準的な強化学習はCLIPを向上させる一方でFIDを悪化させますが、RankEは両方を同時に改善します（MS-COCO 30KにおいてFID 15.21、CLIP 33.76）。Janus-Pro（1B）での一貫した改善により、デコーダの共進化が報酬最適化を確実にピクセル空間の品質向上へと変換することが確認されました。

English

Discrete autoregressive (AR) text-to-image (T2I) models pair a VQ tokenizer with an AR policy, and current post-training pipelines optimize only the policy while keeping the VQ decoder frozen. Recent diffusion T2I work, exemplified by REPA-E, has shown that the VAE itself constitutes a key alignment bottleneck, yet no analogous investigation exists for discrete AR models. We show that policy-only optimization induces Latent Covariate Shift: as the policy evolves, the resulting token distribution diverges from the ground-truth distribution on which the decoder was trained, such that reward scores improve while decoded image quality degrades. To address this mismatch, we propose RankE, the first end-to-end post-training framework for discrete T2I generation. Rather than optimizing the policy against a fixed decoder, RankE co-evolves both components through alternating optimization: each module maximizes a ranking-based alignment objective while being regularized by a stability-preserving anchor suited to its parameter space. This co-evolution breaks the fidelity--alignment trade-off that plagues frozen-decoder approaches: on LlamaGen-XL (775M), standard RL improves CLIP but degrades FID, whereas RankE improves both simultaneously (FID 15.21, CLIP 33.76 on MS-COCO 30K). Consistent gains on Janus-Pro (1B) confirm that decoder co-evolution reliably converts reward optimization into pixel-space quality improvements.