OpenVision:一個全開放、性價比高的先進視覺編碼器家族,適用於多模態學習
OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning
May 7, 2025
作者: Xianhang Li, Yanqing Liu, Haoqin Tu, Hongru Zhu, Cihang Xie
cs.AI
摘要
OpenAI於2021年初發布的CLIP,長期以來一直是構建多模態基礎模型時視覺編碼器的首選。儘管近期如SigLIP等替代方案開始挑戰這一現狀,但據我們所知,尚無完全開源的選項:它們的訓練數據仍屬專有,且/或其訓練方法未公開。本文通過OpenVision填補了這一空白,這是一系列完全開源、性價比高的視覺編碼器,當整合到如LLaVA等多模態框架中時,其性能可媲美甚至超越OpenAI的CLIP。OpenVision基於現有工作——例如,採用CLIPS作為訓練框架,Recap-DataComp-1B作為訓練數據——同時揭示了提升編碼器質量的多項關鍵見解,並展示了在推進多模態模型方面的實際益處。通過發布參數量從5.9M到632.1M不等的視覺編碼器,OpenVision為實踐者在構建多模態模型時提供了容量與效率之間的靈活權衡:更大模型帶來更強的多模態性能,而較小版本則支持輕量級、邊緣部署的多模態應用。
English
OpenAI's CLIP, released in early 2021, have long been the go-to choice of
vision encoder for building multimodal foundation models. Although recent
alternatives such as SigLIP have begun to challenge this status quo, to our
knowledge none are fully open: their training data remains proprietary and/or
their training recipes are not released. This paper fills this gap with
OpenVision, a fully-open, cost-effective family of vision encoders that match
or surpass the performance of OpenAI's CLIP when integrated into multimodal
frameworks like LLaVA. OpenVision builds on existing works -- e.g., CLIPS for
training framework and Recap-DataComp-1B for training data -- while revealing
multiple key insights in enhancing encoder quality and showcasing practical
benefits in advancing multimodal models. By releasing vision encoders spanning
from 5.9M to 632.1M parameters, OpenVision offers practitioners a flexible
trade-off between capacity and efficiency in building multimodal models: larger
models deliver enhanced multimodal performance, while smaller versions enable
lightweight, edge-ready multimodal deployments.Summary
AI-Generated Summary