ボトルネックフリー統合マルチモーダルモデルのための表現フォーシング

要旨

統合マルチモーダルモデル（UMM）は、単一のモデルで知覚と生成を扱うことを目指している。しかし、既存のUMMは依然として画像生成に凍結され別途事前学習されたVAEに依存しており、構造的なボトルネックを課している。それを単純に除去すると品質のギャップが生じる。なぜなら、モデルが生のピクセルから高レベルの構造と低レベルの詳細の両方を学習しなければならないからである。本論文では、表現予測をモデルのネイティブな機能とすることでこのギャップを埋める手法である表現強制（RF）を提案する。具体的には、RFはデコーダに、ピクセルの前に中間トークンとして視覚表現を自己回帰的に予測させる。これらのトークンはその後コンテキスト内に留まり、同じバックボーン内でのピクセル拡散を導く。知覚出力からの表現を生成目標に変換することにより、RFは外部の生成的潜在空間を必要としなくなる。RFは理解と生成の両方に利益をもたらすことがわかる。画像生成において、RFを備えた我々のピクセル空間モデルは、最先端のVAEベースの統合モデルに匹敵する。画像理解において、ピクセル空間RFは一般的にVAEベースの変種よりも優れている。これらの結果は、エンドツーエンドでボトルネックのないUMMに向けた効果的な一歩を提供する。

English

Unified multimodal models (UMMs) aim to handle perception and generation in a single model. Yet existing UMMs still rely on a frozen, separately pretrained VAE for image generation, imposing a structural bottleneck. Naively removing it introduces a quality gap, as the model must learn both high-level structure and low-level details from raw pixels. In this paper, we propose Representation Forcing (RF), a technique that closes this gap by making representation prediction a native capability of the model. Concretely, RF forces the decoder to autoregressively predict visual representations as intermediate tokens before pixels; these tokens then stay in context to guide pixel diffusion within the same backbone. By turning representations from perception outputs into generation targets, RF eliminates the need for any external generative latent space. We find that RF benefits both understanding and generation. On image generation, our pixel-space model with RF matches state-of-the-art VAE-based unified models. On image understanding, pixel-space RF generally outperforms its VAE-based variant. Together, these results offer an effective step toward end-to-end, bottleneck-free UMMs.