在大型語言模型中種下「視覺種子」

摘要

我們提出 SEED，一個精心設計的圖像標記器，賦予大型語言模型（LLMs）具有同時「看見」和「繪製」的新興能力。先前對圖像標記器的研究已經陷入僵局，因為採用量化視覺標記的框架由於在多模態理解（與 BLIP-2 等相比）或生成（與 Stable Diffusion 等相比）方面表現不佳而逐漸失去了重要性。儘管存在這些限制，我們仍對其自然能力以統一視覺和文本表示感到自信，有助於使用LLMs的原始配方進行可擴展的多模態訓練。在這項研究中，我們確定了兩個對於 SEED 的架構和訓練至關重要的原則，有效地促進了與LLMs的後續對齊。（1）圖像標記應該獨立於2D物理補丁位置，而應該以1D因果依賴性生成，展現出與LLMs中從左到右自回歸預測機制一致的內在相互依賴性。（2）圖像標記應捕捉與單詞中語義抽象程度一致的高層語義，並在標記器訓練階段優化以實現區分性和重建性。因此，現成的LLM能夠通過將我們的SEED納入進行高效的LoRA調整來執行圖像到文本和文本到圖像的生成。全面的多模態預訓練和指導調整，可能會產生更好的結果，將保留供未來研究。這個版本的SEED僅使用64個V100 GPU和500萬個公開可用的圖像-文本對在5.7天內進行訓練。我們的初步研究強調了離散視覺標記在多功能多模態LLMs中的巨大潛力，以及在更廣泛的研究中適當的圖像標記器的重要性。

English

We present SEED, an elaborate image tokenizer that empowers Large Language Models (LLMs) with the emergent ability to SEE and Draw at the same time. Research on image tokenizers has previously reached an impasse, as frameworks employing quantized visual tokens have lost prominence due to subpar performance and convergence in multimodal comprehension (compared to BLIP-2, etc.) or generation (compared to Stable Diffusion, etc.). Despite the limitations, we remain confident in its natural capacity to unify visual and textual representations, facilitating scalable multimodal training with LLM's original recipe. In this study, we identify two crucial principles for the architecture and training of SEED that effectively ease subsequent alignment with LLMs. (1) Image tokens should be independent of 2D physical patch positions and instead be produced with a 1D causal dependency, exhibiting intrinsic interdependence that aligns with the left-to-right autoregressive prediction mechanism in LLMs. (2) Image tokens should capture high-level semantics consistent with the degree of semantic abstraction in words, and be optimized for both discriminativeness and reconstruction during the tokenizer training phase. As a result, the off-the-shelf LLM is able to perform both image-to-text and text-to-image generation by incorporating our SEED through efficient LoRA tuning. Comprehensive multimodal pretraining and instruction tuning, which may yield improved results, are reserved for future investigation. This version of SEED was trained in 5.7 days using only 64 V100 GPUs and 5M publicly available image-text pairs. Our preliminary study emphasizes the great potential of discrete visual tokens in versatile multimodal LLMs and the importance of proper image tokenizers in broader research.

在大型語言模型中種下「視覺種子」

Planting a SEED of Vision in Large Language Model

摘要

Support