視覚を方言として：テキスト整合表現による視覚理解と生成の統合

要旨

本論文は、視覚的理解と生成を共有する離散的な意味表現に統合しようとするマルチモーダルフレームワークを提案する。その中核となるのは、テキストアラインドトークナイザ（TA-Tok）であり、大規模言語モデル（LLM）の語彙から投影されたテキストアラインドコードブックを使用して画像を離散トークンに変換する。視覚とテキストを拡張された語彙を持つ統一された空間に統合することで、我々のマルチモーダルLLM「Tar」は、モダリティ固有の設計を必要とせず、共有インターフェースを通じてクロスモーダルな入力と出力を可能にする。さらに、効率と視覚的詳細のバランスを取るためのスケール適応型エンコーディングとデコーディング、および高忠実度の視覚的出力を生成するための生成的デトークナイザを提案する。多様なデコードニーズに対応するため、高速な自己回帰モデルと拡散ベースのモデルという2つの補完的なデトークナイザを利用する。モダリティ融合を強化するため、高度な事前学習タスクを調査し、視覚的理解と生成の両方で改善を示す。ベンチマークを跨いだ実験により、Tarは既存のマルチモーダルLLM手法に匹敵またはそれを上回り、より速い収束と高いトレーニング効率を達成することが示された。コード、モデル、データはhttps://tar.csuhan.comで公開されている。

English

This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model's (LLM) vocabulary. By integrating vision and text into a unified space with an expanded vocabulary, our multimodal LLM, Tar, enables cross-modal input and output through a shared interface, without the need for modality-specific designs. Additionally, we propose scale-adaptive encoding and decoding to balance efficiency and visual detail, along with a generative de-tokenizer to produce high-fidelity visual outputs. To address diverse decoding needs, we utilize two complementary de-tokenizers: a fast autoregressive model and a diffusion-based model. To enhance modality fusion, we investigate advanced pre-training tasks, demonstrating improvements in both visual understanding and generation. Experiments across benchmarks show that Tar matches or surpasses existing multimodal LLM methods, achieving faster convergence and greater training efficiency. Code, models, and data are available at https://tar.csuhan.com

視覚を方言として：テキスト整合表現による視覚理解と生成の統合

Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations

要旨

Support