엔드투엔드 비전 토크나이저 튜닝

초록

기존의 비전 토큰화는 비전 토크나이저의 최적화를 다운스트림 학습과 분리하여, 비전 토큰이 이미지 생성 및 시각적 질문 응답과 같은 다양한 작업에서 잘 일반화될 수 있다는 것을 암묵적으로 가정합니다. 저수준 재구성을 위해 최적화된 비전 토크나이저는 다양한 표현과 의미를 요구하는 다운스트림 작업에 대해 무관합니다. 이러한 분리된 패러다임은 중요한 불일치를 초래합니다: 비전 토큰화의 손실은 목표 작업에 대한 표현 병목 현상이 될 수 있습니다. 예를 들어, 주어진 이미지에서 텍스트를 토큰화하는 과정에서 발생하는 오류는 이를 인식하거나 생성할 때 좋지 않은 결과를 초래합니다. 이를 해결하기 위해, 우리는 비전 토큰화와 목표 자기회귀 작업 간의 공동 최적화를 가능하게 하는 ETT(End-to-End Vision Tokenizer Tuning) 접근법을 제안합니다. 기존의 자기회귀 모델이 고정된 비전 토크나이저의 이산 인덱스만 사용하는 것과 달리, ETT는 토크나이저 코드북의 시각적 임베딩을 활용하고, 재구성 및 캡션 목표를 함께 사용하여 비전 토크나이저를 종단 간으로 최적화합니다. ETT는 최소한의 아키텍처 수정만으로 기존의 학습 파이프라인에 원활하게 통합될 수 있습니다. 우리의 ETT는 구현 및 통합이 간단하며, 사용된 대형 언어 모델의 원래 코드북이나 아키텍처를 조정할 필요가 없습니다. 광범위한 실험을 통해, 우리가 제안한 종단 간 비전 토크나이저 튜닝이 고정된 토크나이저 기준선에 비해 멀티모달 이해 및 시각적 생성 작업에서 2-6%의 상당한 성능 향상을 가져오는 동시에 원래의 재구성 능력을 유지한다는 것을 입증했습니다. 우리는 이 매우 간단하면서도 강력한 방법이 이미지 생성 및 이해를 넘어 멀티모달 기반 모델을 강화할 수 있기를 바랍니다.

English

Existing vision tokenization isolates the optimization of vision tokenizers from downstream training, implicitly assuming the visual tokens can generalize well across various tasks, e.g., image generation and visual question answering. The vision tokenizer optimized for low-level reconstruction is agnostic to downstream tasks requiring varied representations and semantics. This decoupled paradigm introduces a critical misalignment: The loss of the vision tokenization can be the representation bottleneck for target tasks. For example, errors in tokenizing text in a given image lead to poor results when recognizing or generating them. To address this, we propose ETT, an end-to-end vision tokenizer tuning approach that enables joint optimization between vision tokenization and target autoregressive tasks. Unlike prior autoregressive models that use only discrete indices from a frozen vision tokenizer, ETT leverages the visual embeddings of the tokenizer codebook, and optimizes the vision tokenizers end-to-end with both reconstruction and caption objectives. ETT can be seamlessly integrated into existing training pipelines with minimal architecture modifications. Our ETT is simple to implement and integrate, without the need to adjust the original codebooks or architectures of the employed large language models. Extensive experiments demonstrate that our proposed end-to-end vision tokenizer tuning unlocks significant performance gains, i.e., 2-6% for multimodal understanding and visual generation tasks compared to frozen tokenizer baselines, while preserving the original reconstruction capability. We hope this very simple and strong method can empower multimodal foundation models besides image generation and understanding.