揭示無編碼器的視覺語言模型
Unveiling Encoder-Free Vision-Language Models
June 17, 2024
作者: Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, Xinlong Wang
cs.AI
摘要
現有的視覺語言模型(VLMs)主要依賴視覺編碼器來提取視覺特徵,然後使用大型語言模型(LLMs)進行視覺語言任務。然而,視覺編碼器在提取視覺表示方面設定了強烈的歸納偏差,例如解析度、長寬比和語義先驗,這可能會影響VLMs的靈活性和效率。訓練純粹的VLMs,即接受無縫視覺和語言輸入,即沒有視覺編碼器,仍然具有挑戰性且鮮少被探索。實證觀察顯示,直接訓練而無編碼器導致收斂速度緩慢且性能差距大。在這項工作中,我們彌合了基於編碼器和無編碼器模型之間的差距,提出了一個簡單而有效的訓練方法,以實現純粹的VLMs。具體來說,我們透過深入實驗揭示了有效訓練無編碼器VLMs的關鍵方面:(1)在一個統一的解碼器內建立視覺語言表示;(2)通過額外監督來增強視覺識別能力。憑藉這些策略,我們推出了EVE,一個無編碼器的視覺語言模型,可以高效地進行訓練和推理。值得注意的是,僅利用3500萬個公開可訪問的數據,EVE在多個視覺語言基準測試中令人印象深刻地與容量相似的基於編碼器的VLMs相媲美。它明顯優於具有神秘訓練程序和未公開訓練數據的對應模型Fuyu-8B。我們相信EVE為跨模態開發純粹的僅解碼器架構提供了一條透明且高效的途徑。我們的代碼和模型可在以下鏈接公開獲取:https://github.com/baaivision/EVE。
English
Existing vision-language models (VLMs) mostly rely on vision encoders to
extract visual features followed by large language models (LLMs) for
visual-language tasks. However, the vision encoders set a strong inductive bias
in abstracting visual representation, e.g., resolution, aspect ratio, and
semantic priors, which could impede the flexibility and efficiency of the VLMs.
Training pure VLMs that accept the seamless vision and language inputs, i.e.,
without vision encoders, remains challenging and rarely explored. Empirical
observations reveal that direct training without encoders results in slow
convergence and large performance gaps. In this work, we bridge the gap between
encoder-based and encoder-free models, and present a simple yet effective
training recipe towards pure VLMs. Specifically, we unveil the key aspects of
training encoder-free VLMs efficiently via thorough experiments: (1) Bridging
vision-language representation inside one unified decoder; (2) Enhancing visual
recognition capability via extra supervision. With these strategies, we launch
EVE, an encoder-free vision-language model that can be trained and forwarded
efficiently. Notably, solely utilizing 35M publicly accessible data, EVE can
impressively rival the encoder-based VLMs of similar capacities across multiple
vision-language benchmarks. It significantly outperforms the counterpart
Fuyu-8B with mysterious training procedures and undisclosed training data. We
believe that EVE provides a transparent and efficient route for developing a
pure decoder-only architecture across modalities. Our code and models are
publicly available at: https://github.com/baaivision/EVE.Summary
AI-Generated Summary