エンドツーエンド視覚トークナイザーチューニング

要旨

既存の視覚トークン化手法は、視覚トークナイザの最適化を下流の学習から切り離しており、視覚トークンが画像生成や視覚質問応答などの様々なタスクにわたって汎化できることを暗黙的に仮定している。低レベルな再構成に最適化された視覚トークナイザは、多様な表現と意味を必要とする下流タスクに対して無知覚である。この分離されたパラダイムは、重要なミスアラインメントを引き起こす：視覚トークン化の損失が、目標タスクにおける表現のボトルネックとなる可能性がある。例えば、与えられた画像中のテキストをトークン化する際のエラーは、それらを認識または生成する際に悪い結果をもたらす。これを解決するために、我々はETT（End-to-End Vision Tokenizer Tuning）を提案する。これは、視覚トークン化と目標自己回帰タスクの間の共同最適化を可能にするエンドツーエンドの視覚トークナイザ調整手法である。従来の自己回帰モデルが凍結された視覚トークナイザからの離散インデックスのみを使用するのとは異なり、ETTはトークナイザのコードブックの視覚埋め込みを活用し、再構成とキャプションの目的関数を用いて視覚トークナイザをエンドツーエンドで最適化する。ETTは、最小限のアーキテクチャ変更で既存の学習パイプラインにシームレスに統合できる。我々のETTは実装と統合が簡単で、使用されている大規模言語モデルの元のコードブックやアーキテクチャを調整する必要がない。広範な実験により、提案されたエンドツーエンド視覚トークナイザ調整が、凍結されたトークナイザベースラインと比較して、マルチモーダル理解と視覚生成タスクにおいて2-6%の大幅な性能向上をもたらすことが示された。同時に、元の再構成能力も維持されている。この非常にシンプルで強力な手法が、画像生成や理解以外のマルチモーダル基盤モデルにも役立つことを期待している。

English

Existing vision tokenization isolates the optimization of vision tokenizers from downstream training, implicitly assuming the visual tokens can generalize well across various tasks, e.g., image generation and visual question answering. The vision tokenizer optimized for low-level reconstruction is agnostic to downstream tasks requiring varied representations and semantics. This decoupled paradigm introduces a critical misalignment: The loss of the vision tokenization can be the representation bottleneck for target tasks. For example, errors in tokenizing text in a given image lead to poor results when recognizing or generating them. To address this, we propose ETT, an end-to-end vision tokenizer tuning approach that enables joint optimization between vision tokenization and target autoregressive tasks. Unlike prior autoregressive models that use only discrete indices from a frozen vision tokenizer, ETT leverages the visual embeddings of the tokenizer codebook, and optimizes the vision tokenizers end-to-end with both reconstruction and caption objectives. ETT can be seamlessly integrated into existing training pipelines with minimal architecture modifications. Our ETT is simple to implement and integrate, without the need to adjust the original codebooks or architectures of the employed large language models. Extensive experiments demonstrate that our proposed end-to-end vision tokenizer tuning unlocks significant performance gains, i.e., 2-6% for multimodal understanding and visual generation tasks compared to frozen tokenizer baselines, while preserving the original reconstruction capability. We hope this very simple and strong method can empower multimodal foundation models besides image generation and understanding.