대형 언어 모델에 비전의 SEED 심기

초록

우리는 대규모 언어 모델(LLM)이 동시에 '보고 그리는' 능력을 갖출 수 있게 하는 정교한 이미지 토크나이저인 SEED를 소개한다. 이미지 토크나이저 연구는 이전에 정체 상태에 빠졌는데, 양자화된 시각적 토큰을 사용하는 프레임워크들이 다중모드 이해(예: BLIP-2 등)나 생성(예: Stable Diffusion 등)에서의 낮은 성능과 수렴 문제로 인해 주목받지 못했기 때문이다. 이러한 한계에도 불구하고, 우리는 시각적 표현과 텍스트 표현을 통합하고 LLM의 원래 레시피로 확장 가능한 다중모드 학습을 촉진하는 자연스러운 능력에 대해 확신을 가지고 있다. 본 연구에서 우리는 SEED의 아키텍처와 학습에 있어 LLM과의 후속 정렬을 효과적으로 용이하게 하는 두 가지 중요한 원칙을 확인했다. (1) 이미지 토큰은 2D 물리적 패치 위치에 독립적이어야 하며, 대신 1D 인과적 의존성을 통해 생성되어야 한다. 이는 LLM의 왼쪽에서 오른쪽으로의 자기회귀 예측 메커니즘과 일치하는 내재적 상호의존성을 보여준다. (2) 이미지 토큰은 단어의 의미적 추상화 수준과 일치하는 고수준 의미를 포착해야 하며, 토크나이저 학습 단계에서 판별력과 재구성을 모두 최적화해야 한다. 결과적으로, 기존의 LLM은 효율적인 LoRA 튜닝을 통해 우리의 SEED를 통합함으로써 이미지-텍스트 및 텍스트-이미지 생성을 모두 수행할 수 있다. 더 나은 결과를 얻을 수 있는 포괄적인 다중모드 사전 학습과 지시 튜닝은 향후 연구를 위해 남겨두었다. 이 버전의 SEED는 64개의 V100 GPU와 500만 개의 공개 이미지-텍스트 쌍을 사용하여 5.7일 동안 학습되었다. 우리의 예비 연구는 다재다능한 다중모드 LLM에서 이산적 시각적 토큰의 큰 잠재력과 더 넓은 연구에서 적절한 이미지 토크나이저의 중요성을 강조한다.

English

We present SEED, an elaborate image tokenizer that empowers Large Language Models (LLMs) with the emergent ability to SEE and Draw at the same time. Research on image tokenizers has previously reached an impasse, as frameworks employing quantized visual tokens have lost prominence due to subpar performance and convergence in multimodal comprehension (compared to BLIP-2, etc.) or generation (compared to Stable Diffusion, etc.). Despite the limitations, we remain confident in its natural capacity to unify visual and textual representations, facilitating scalable multimodal training with LLM's original recipe. In this study, we identify two crucial principles for the architecture and training of SEED that effectively ease subsequent alignment with LLMs. (1) Image tokens should be independent of 2D physical patch positions and instead be produced with a 1D causal dependency, exhibiting intrinsic interdependence that aligns with the left-to-right autoregressive prediction mechanism in LLMs. (2) Image tokens should capture high-level semantics consistent with the degree of semantic abstraction in words, and be optimized for both discriminativeness and reconstruction during the tokenizer training phase. As a result, the off-the-shelf LLM is able to perform both image-to-text and text-to-image generation by incorporating our SEED through efficient LoRA tuning. Comprehensive multimodal pretraining and instruction tuning, which may yield improved results, are reserved for future investigation. This version of SEED was trained in 5.7 days using only 64 V100 GPUs and 5M publicly available image-text pairs. Our preliminary study emphasizes the great potential of discrete visual tokens in versatile multimodal LLMs and the importance of proper image tokenizers in broader research.

대형 언어 모델에 비전의 SEED 심기

Planting a SEED of Vision in Large Language Model

초록

Support