ChatPaper.aiChatPaper

揭示无编码器的视觉-语言模型

Unveiling Encoder-Free Vision-Language Models

June 17, 2024
作者: Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, Xinlong Wang
cs.AI

摘要

现有的视觉语言模型(VLMs)主要依赖视觉编码器来提取视觉特征,然后使用大型语言模型(LLMs)来进行视觉-语言任务。然而,视觉编码器在提取视觉表示方面设定了强烈的归纳偏差,例如分辨率、长宽比和语义先验,这可能会影响VLMs的灵活性和效率。训练纯VLMs,即接受无缝视觉和语言输入的模型,即没有视觉编码器,仍然具有挑战性且鲜为人知。经验观察表明,直接训练无编码器的模型会导致收敛速度缓慢且性能差距大。在这项工作中,我们弥合了基于编码器和无编码器模型之间的差距,并提出了一种简单而有效的训练方法,以实现纯VLMs。具体而言,我们通过深入实验揭示了有效训练无编码器VLMs的关键方面:(1)在一个统一的解码器内建立视觉-语言表示;(2)通过额外的监督增强视觉识别能力。通过这些策略,我们推出了EVE,一种无编码器的视觉语言模型,可以高效地进行训练和推断。值得注意的是,仅利用3500万个公开可访问的数据,EVE在多个视觉-语言基准测试中令人印象深刻地与类似容量的基于编码器的VLMs相媲美。它明显优于具有神秘训练程序和未公开训练数据的对应的Fuyu-8B。我们相信EVE为跨模态开发纯解码器架构提供了透明且高效的途径。我们的代码和模型可在以下网址公开获取:https://github.com/baaivision/EVE。
English
Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks. However, the vision encoders set a strong inductive bias in abstracting visual representation, e.g., resolution, aspect ratio, and semantic priors, which could impede the flexibility and efficiency of the VLMs. Training pure VLMs that accept the seamless vision and language inputs, i.e., without vision encoders, remains challenging and rarely explored. Empirical observations reveal that direct training without encoders results in slow convergence and large performance gaps. In this work, we bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs. Specifically, we unveil the key aspects of training encoder-free VLMs efficiently via thorough experiments: (1) Bridging vision-language representation inside one unified decoder; (2) Enhancing visual recognition capability via extra supervision. With these strategies, we launch EVE, an encoder-free vision-language model that can be trained and forwarded efficiently. Notably, solely utilizing 35M publicly accessible data, EVE can impressively rival the encoder-based VLMs of similar capacities across multiple vision-language benchmarks. It significantly outperforms the counterpart Fuyu-8B with mysterious training procedures and undisclosed training data. We believe that EVE provides a transparent and efficient route for developing a pure decoder-only architecture across modalities. Our code and models are publicly available at: https://github.com/baaivision/EVE.

Summary

AI-Generated Summary

PDF554November 28, 2024