Cheers: パッチ詳細と意味表現の分離による統合マルチモーダル理解・生成の実現

要旨

近年、マルチモーダルモデリングにおける最先端のトピックとして、単一モデル内での視覚的理解と生成の統合が挙げられます。しかし、これら2つのタスクは互いに適合しないデコーディング方式と視覚表現を必要とするため、共有特徴空間での共同最適化は容易ではありません。本研究では、Cheersを提案します。これは、パッチレベルの詳細を意味表現から分離することで、マルチモーダル理解における意味の安定化と、ゲート付き詳細残差による画像生成の高忠実度化を実現する統合マルチモーダルモデルです。Cheersは以下の3つの主要コンポーネントを含みます：(i) 画像潜在状態を符号化・圧縮し、効率的なLLM条件付けのための意味トークンに変換する統合ビジョントークナイザ、(ii) テキスト生成のための自己回帰デコーディングと画像生成のための拡散デコーディングを統合するLLMベースのトランスフォーマー、(iii) 視覚的意味を最初にデコードし、その後ビジョントークナイザからの意味ゲート付き詳細残差を注入して高周波コンテンツを精緻化するカスケードフローマッチングヘッド。主要ベンチマークでの実験により、Cheersが視覚理解と生成の両方において先進的なUMMを凌駕または同等の性能を発揮することが実証されました。またCheersは4倍のトークン圧縮を達成し、高解像度画像の符号化と生成をより効率的に実現します。特に、CheersはGenEvalおよびMMBenchベンチマークにおいてTar-1.5Bを上回る性能を示し、訓練コストはわずか20%で済んでおり、効果的かつ効率的（すなわち4倍のトークン圧縮）な統合マルチモーダルモデリングを実現しています。今後の研究のため、すべてのコードとデータを公開予定です。

English

A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image generation via gated detail residuals. Cheers includes three key components: (i) a unified vision tokenizer that encodes and compresses image latent states into semantic tokens for efficient LLM conditioning, (ii) an LLM-based Transformer that unifies autoregressive decoding for text generation and diffusion decoding for image generation, and (iii) a cascaded flow matching head that decodes visual semantics first and then injects semantically gated detail residuals from the vision tokenizer to refine high-frequency content. Experiments on popular benchmarks demonstrate that Cheers matches or surpasses advanced UMMs in both visual understanding and generation. Cheers also achieves 4x token compression, enabling more efficient high-resolution image encoding and generation. Notably, Cheers outperforms the Tar-1.5B on the popular benchmarks GenEval and MMBench, while requiring only 20% of the training cost, indicating effective and efficient (i.e., 4x token compression) unified multimodal modeling. We will release all code and data for future research.

Cheers: パッチ詳細と意味表現の分離による統合マルチモーダル理解・生成の実現

Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

要旨

Support