在大型語言模型中種下「視覺種子」
Planting a SEED of Vision in Large Language Model
July 16, 2023
作者: Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, Ying Shan
cs.AI
摘要
我們提出 SEED,一個精心設計的圖像標記器,賦予大型語言模型(LLMs)具有同時「看見」和「繪製」的新興能力。先前對圖像標記器的研究已經陷入僵局,因為採用量化視覺標記的框架由於在多模態理解(與 BLIP-2 等相比)或生成(與 Stable Diffusion 等相比)方面表現不佳而逐漸失去了重要性。儘管存在這些限制,我們仍對其自然能力以統一視覺和文本表示感到自信,有助於使用LLMs的原始配方進行可擴展的多模態訓練。在這項研究中,我們確定了兩個對於 SEED 的架構和訓練至關重要的原則,有效地促進了與LLMs的後續對齊。 (1)圖像標記應該獨立於2D物理補丁位置,而應該以1D因果依賴性生成,展現出與LLMs中從左到右自回歸預測機制一致的內在相互依賴性。 (2)圖像標記應捕捉與單詞中語義抽象程度一致的高層語義,並在標記器訓練階段優化以實現區分性和重建性。因此,現成的LLM能夠通過將我們的SEED納入進行高效的LoRA調整來執行圖像到文本和文本到圖像的生成。全面的多模態預訓練和指導調整,可能會產生更好的結果,將保留供未來研究。這個版本的SEED僅使用64個V100 GPU和500萬個公開可用的圖像-文本對在5.7天內進行訓練。我們的初步研究強調了離散視覺標記在多功能多模態LLMs中的巨大潛力,以及在更廣泛的研究中適當的圖像標記器的重要性。
English
We present SEED, an elaborate image tokenizer that empowers Large Language
Models (LLMs) with the emergent ability to SEE and Draw at the same time.
Research on image tokenizers has previously reached an impasse, as frameworks
employing quantized visual tokens have lost prominence due to subpar
performance and convergence in multimodal comprehension (compared to BLIP-2,
etc.) or generation (compared to Stable Diffusion, etc.). Despite the
limitations, we remain confident in its natural capacity to unify visual and
textual representations, facilitating scalable multimodal training with LLM's
original recipe. In this study, we identify two crucial principles for the
architecture and training of SEED that effectively ease subsequent alignment
with LLMs. (1) Image tokens should be independent of 2D physical patch
positions and instead be produced with a 1D causal dependency, exhibiting
intrinsic interdependence that aligns with the left-to-right autoregressive
prediction mechanism in LLMs. (2) Image tokens should capture high-level
semantics consistent with the degree of semantic abstraction in words, and be
optimized for both discriminativeness and reconstruction during the tokenizer
training phase. As a result, the off-the-shelf LLM is able to perform both
image-to-text and text-to-image generation by incorporating our SEED through
efficient LoRA tuning. Comprehensive multimodal pretraining and instruction
tuning, which may yield improved results, are reserved for future
investigation. This version of SEED was trained in 5.7 days using only 64 V100
GPUs and 5M publicly available image-text pairs. Our preliminary study
emphasizes the great potential of discrete visual tokens in versatile
multimodal LLMs and the importance of proper image tokenizers in broader
research.