大規模言語モデルに視覚のSEEDを植え付ける

要旨

本論文では、大規模言語モデル（LLM）に「見る」と「描く」という新たな能力を同時に与える精巧な画像トークナイザーであるSEEDを提案する。これまでの画像トークナイザーの研究は行き詰まっていた。量子化された視覚トークンを用いるフレームワークは、マルチモーダル理解（BLIP-2などと比較して）や生成（Stable Diffusionなどと比較して）において性能や収束性が低く、注目を集めなくなっていた。しかし、我々はその自然な能力に依然として自信を持っており、視覚的表現とテキスト表現を統合し、LLMのオリジナルレシピを用いたスケーラブルなマルチモーダル学習を促進する可能性を信じている。本研究では、SEEDのアーキテクチャと学習において、LLMとの後続のアラインメントを効果的に容易にする2つの重要な原則を特定した。(1) 画像トークンは2D物理パッチの位置に依存せず、1D因果依存関係で生成されるべきであり、LLMの左から右への自己回帰予測メカニズムと整合する内在的な相互依存性を示す。(2) 画像トークンは、単語の意味的抽象化の程度と一致する高レベルのセマンティクスを捉え、トークナイザーの学習段階で識別性と再構成の両方を最適化するべきである。その結果、既存のLLMは、効率的なLoRAチューニングを通じて我々のSEEDを組み込むことで、画像からテキストへの生成とテキストから画像への生成の両方を実行できるようになる。より良い結果をもたらす可能性のある包括的なマルチモーダル事前学習と指示チューニングは、今後の研究に委ねられている。このバージョンのSEEDは、64台のV100 GPUと500万の公開画像テキストペアを使用して5.7日間で学習された。我々の予備的な研究は、多様なマルチモーダルLLMにおける離散視覚トークンの大きな可能性と、より広範な研究における適切な画像トークナイザーの重要性を強調している。

English

We present SEED, an elaborate image tokenizer that empowers Large Language Models (LLMs) with the emergent ability to SEE and Draw at the same time. Research on image tokenizers has previously reached an impasse, as frameworks employing quantized visual tokens have lost prominence due to subpar performance and convergence in multimodal comprehension (compared to BLIP-2, etc.) or generation (compared to Stable Diffusion, etc.). Despite the limitations, we remain confident in its natural capacity to unify visual and textual representations, facilitating scalable multimodal training with LLM's original recipe. In this study, we identify two crucial principles for the architecture and training of SEED that effectively ease subsequent alignment with LLMs. (1) Image tokens should be independent of 2D physical patch positions and instead be produced with a 1D causal dependency, exhibiting intrinsic interdependence that aligns with the left-to-right autoregressive prediction mechanism in LLMs. (2) Image tokens should capture high-level semantics consistent with the degree of semantic abstraction in words, and be optimized for both discriminativeness and reconstruction during the tokenizer training phase. As a result, the off-the-shelf LLM is able to perform both image-to-text and text-to-image generation by incorporating our SEED through efficient LoRA tuning. Comprehensive multimodal pretraining and instruction tuning, which may yield improved results, are reserved for future investigation. This version of SEED was trained in 5.7 days using only 64 V100 GPUs and 5M publicly available image-text pairs. Our preliminary study emphasizes the great potential of discrete visual tokens in versatile multimodal LLMs and the importance of proper image tokenizers in broader research.

大規模言語モデルに視覚のSEEDを植え付ける

Planting a SEED of Vision in Large Language Model

要旨

Support