ChatPaper.aiChatPaper

EVEv2:改進的無編碼器視覺語言模型基準

EVEv2: Improved Baselines for Encoder-Free Vision-Language Models

February 10, 2025
作者: Haiwen Diao, Xiaotong Li, Yufeng Cui, Yueze Wang, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, Xinlong Wang
cs.AI

摘要

目前的無編碼視覺語言模型(VLMs)正迅速縮小與基於編碼器的對應物之間的性能差距,凸顯統一多模式系統具有結構簡單性和高效部署潛力。我們系統性地澄清了使用預訓練視覺編碼器、離散分詞器和從頭開始的極簡視覺層的VLMs之間的性能差距,深入挖掘了未受關注的無編碼VLMs的特徵。我們為無編碼VLMs開發了高效策略,可與主流基於編碼器的模型媲美。經過深入研究,我們推出了EVEv2.0,這是一個新且改進的無編碼VLMs系列。我們指出:(i)在統一模型中適當分解並分層關聯視覺和語言可減少模態之間的干擾。(ii)一個設計良好的訓練策略可實現對無編碼VLMs的有效優化。通過廣泛評估,我們的EVEv2.0代表了一項深入研究,以開發跨模式的僅解碼器架構,展示出優越的數據效率和強大的視覺推理能力。代碼可在以下網址公開獲取:https://github.com/baaivision/EVE。
English
Existing encoder-free vision-language models (VLMs) are rapidly narrowing the performance gap with their encoder-based counterparts, highlighting the promising potential for unified multimodal systems with structural simplicity and efficient deployment. We systematically clarify the performance gap between VLMs using pre-trained vision encoders, discrete tokenizers, and minimalist visual layers from scratch, deeply excavating the under-examined characteristics of encoder-free VLMs. We develop efficient strategies for encoder-free VLMs that rival mainstream encoder-based ones. After an in-depth investigation, we launch EVEv2.0, a new and improved family of encoder-free VLMs. We show that: (i) Properly decomposing and hierarchically associating vision and language within a unified model reduces interference between modalities. (ii) A well-designed training strategy enables effective optimization for encoder-free VLMs. Through extensive evaluation, our EVEv2.0 represents a thorough study for developing a decoder-only architecture across modalities, demonstrating superior data efficiency and strong vision-reasoning capability. Code is publicly available at: https://github.com/baaivision/EVE.

Summary

AI-Generated Summary

PDF122February 11, 2025