ChatPaper.aiChatPaper

端到端視覺標記器調優

End-to-End Vision Tokenizer Tuning

May 15, 2025
作者: Wenxuan Wang, Fan Zhang, Yufeng Cui, Haiwen Diao, Zhuoyan Luo, Huchuan Lu, Jing Liu, Xinlong Wang
cs.AI

摘要

现有的视觉标记化方法将视觉标记器的优化与下游训练过程相隔离,隐含地假设视觉标记能够良好地泛化于多种任务,如图像生成与视觉问答。专为低层次重建优化的视觉标记器,对于需要多样化表示与语义的下游任务而言,是“无意识”的。这种解耦模式引入了一个关键的不匹配问题:视觉标记化的损失可能成为目标任务中的表示瓶颈。例如,在给定图像中文本标记化时的错误,会导致识别或生成这些文本时效果不佳。为解决此问题,我们提出了ETT,一种端到端的视觉标记器调优方法,它实现了视觉标记化与目标自回归任务间的联合优化。与以往仅使用冻结视觉标记器离散索引的自回归模型不同,ETT利用标记器码本的视觉嵌入,并通过重建与描述目标共同优化视觉标记器。ETT能够以最小的架构改动,无缝融入现有的训练流程中。我们的ETT易于实现与集成,无需调整所采用大型语言模型的原始码本或架构。大量实验证明,我们提出的端到端视觉标记器调优方法,相较于冻结标记器基线,在多模态理解与视觉生成任务上带来了显著的性能提升,即2-6%的增益,同时保持了原有的重建能力。我们希望这一简单而强大的方法,除了图像生成与理解之外,还能赋能于多模态基础模型。
English
Existing vision tokenization isolates the optimization of vision tokenizers from downstream training, implicitly assuming the visual tokens can generalize well across various tasks, e.g., image generation and visual question answering. The vision tokenizer optimized for low-level reconstruction is agnostic to downstream tasks requiring varied representations and semantics. This decoupled paradigm introduces a critical misalignment: The loss of the vision tokenization can be the representation bottleneck for target tasks. For example, errors in tokenizing text in a given image lead to poor results when recognizing or generating them. To address this, we propose ETT, an end-to-end vision tokenizer tuning approach that enables joint optimization between vision tokenization and target autoregressive tasks. Unlike prior autoregressive models that use only discrete indices from a frozen vision tokenizer, ETT leverages the visual embeddings of the tokenizer codebook, and optimizes the vision tokenizers end-to-end with both reconstruction and caption objectives. ETT can be seamlessly integrated into existing training pipelines with minimal architecture modifications. Our ETT is simple to implement and integrate, without the need to adjust the original codebooks or architectures of the employed large language models. Extensive experiments demonstrate that our proposed end-to-end vision tokenizer tuning unlocks significant performance gains, i.e., 2-6% for multimodal understanding and visual generation tasks compared to frozen tokenizer baselines, while preserving the original reconstruction capability. We hope this very simple and strong method can empower multimodal foundation models besides image generation and understanding.

Summary

AI-Generated Summary

PDF163May 16, 2025