端到端视觉分词器调优

摘要

现有的视觉分词方法将视觉分词器的优化与下游训练过程割裂开来，隐含地假设视觉分词结果能够在各种任务（如图像生成和视觉问答）中良好泛化。然而，专为低层次重建优化的视觉分词器，对需要多样化表示和语义的下游任务并不敏感。这种解耦范式引入了一个关键的不匹配问题：视觉分词过程中的损失可能成为目标任务中的表示瓶颈。例如，在给定图像中对文本进行分词时的错误，会导致识别或生成这些文本时效果不佳。为解决这一问题，我们提出了ETT，一种端到端的视觉分词器调优方法，实现了视觉分词与目标自回归任务之间的联合优化。与以往仅使用冻结视觉分词器离散索引的自回归模型不同，ETT利用分词器码本的视觉嵌入，并通过重建和描述目标对视觉分词器进行端到端优化。ETT能够以最小的架构修改无缝集成到现有训练流程中，其实现和集成简单，无需调整所采用大型语言模型的原始码本或架构。大量实验表明，我们提出的端到端视觉分词器调优方法带来了显著的性能提升，在多模态理解和视觉生成任务上相比冻结分词器基线提高了2-6%，同时保持了原有的重建能力。我们希望这一简单而强大的方法能够赋能除图像生成和理解之外的多模态基础模型。

English

Existing vision tokenization isolates the optimization of vision tokenizers from downstream training, implicitly assuming the visual tokens can generalize well across various tasks, e.g., image generation and visual question answering. The vision tokenizer optimized for low-level reconstruction is agnostic to downstream tasks requiring varied representations and semantics. This decoupled paradigm introduces a critical misalignment: The loss of the vision tokenization can be the representation bottleneck for target tasks. For example, errors in tokenizing text in a given image lead to poor results when recognizing or generating them. To address this, we propose ETT, an end-to-end vision tokenizer tuning approach that enables joint optimization between vision tokenization and target autoregressive tasks. Unlike prior autoregressive models that use only discrete indices from a frozen vision tokenizer, ETT leverages the visual embeddings of the tokenizer codebook, and optimizes the vision tokenizers end-to-end with both reconstruction and caption objectives. ETT can be seamlessly integrated into existing training pipelines with minimal architecture modifications. Our ETT is simple to implement and integrate, without the need to adjust the original codebooks or architectures of the employed large language models. Extensive experiments demonstrate that our proposed end-to-end vision tokenizer tuning unlocks significant performance gains, i.e., 2-6% for multimodal understanding and visual generation tasks compared to frozen tokenizer baselines, while preserving the original reconstruction capability. We hope this very simple and strong method can empower multimodal foundation models besides image generation and understanding.

端到端视觉分词器调优

End-to-End Vision Tokenizer Tuning

摘要

Support