在大型语言模型中播种视觉的种子
Planting a SEED of Vision in Large Language Model
July 16, 2023
作者: Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, Ying Shan
cs.AI
摘要
我们提出了SEED,这是一个精心设计的图像标记器,赋予大型语言模型(LLMs)同时看和绘画的新能力。以往对图像标记器的研究陷入僵局,因为采用量化视觉标记的框架由于性能不佳和在多模态理解(与BLIP-2等相比)或生成(与稳定扩散等相比)方面的收敛而逐渐失去了重要性。尽管存在局限性,我们仍对其自然能力进行统一视觉和文本表示的信心,有助于通过LLMs的原始配方进行可扩展的多模态训练。在这项研究中,我们确定了SEED架构和训练的两个关键原则,有效地促进了与LLMs的后续对齐。 (1)图像标记应独立于2D物理补丁位置,而应通过1D因果依赖性生成,展现出内在相互依赖性,与LLMs中的从左到右自回归预测机制相一致。 (2)图像标记应捕捉与单词中语义抽象程度一致的高级语义,并在标记器训练阶段优化以实现区分性和重建性。因此,现成的LLM能够通过高效的LoRA调整将我们的SEED整合进来,实现图像到文本和文本到图像的生成。全面的多模态预训练和指导调整可能会产生改进的结果,这将留待未来研究。这个版本的SEED仅使用64个V100 GPU和500万个公开可用的图像文本对在5.7天内进行了训练。我们的初步研究强调了离散视觉标记在多功能多模态LLMs中的巨大潜力,以及在更广泛研究中正确的图像标记器的重要性。
English
We present SEED, an elaborate image tokenizer that empowers Large Language
Models (LLMs) with the emergent ability to SEE and Draw at the same time.
Research on image tokenizers has previously reached an impasse, as frameworks
employing quantized visual tokens have lost prominence due to subpar
performance and convergence in multimodal comprehension (compared to BLIP-2,
etc.) or generation (compared to Stable Diffusion, etc.). Despite the
limitations, we remain confident in its natural capacity to unify visual and
textual representations, facilitating scalable multimodal training with LLM's
original recipe. In this study, we identify two crucial principles for the
architecture and training of SEED that effectively ease subsequent alignment
with LLMs. (1) Image tokens should be independent of 2D physical patch
positions and instead be produced with a 1D causal dependency, exhibiting
intrinsic interdependence that aligns with the left-to-right autoregressive
prediction mechanism in LLMs. (2) Image tokens should capture high-level
semantics consistent with the degree of semantic abstraction in words, and be
optimized for both discriminativeness and reconstruction during the tokenizer
training phase. As a result, the off-the-shelf LLM is able to perform both
image-to-text and text-to-image generation by incorporating our SEED through
efficient LoRA tuning. Comprehensive multimodal pretraining and instruction
tuning, which may yield improved results, are reserved for future
investigation. This version of SEED was trained in 5.7 days using only 64 V100
GPUs and 5M publicly available image-text pairs. Our preliminary study
emphasizes the great potential of discrete visual tokens in versatile
multimodal LLMs and the importance of proper image tokenizers in broader
research.