OpenVision:一个完全开源、经济高效的先进视觉编码器家族,面向多模态学习
OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning
May 7, 2025
作者: Xianhang Li, Yanqing Liu, Haoqin Tu, Hongru Zhu, Cihang Xie
cs.AI
摘要
OpenAI于2021年初发布的CLIP模型,长期以来一直是构建多模态基础模型时视觉编码器的首选。尽管近期如SigLIP等替代方案开始挑战这一现状,但据我们所知,尚无完全开放的选择:它们的训练数据仍属专有,且/或训练方法未公开。本文通过OpenVision填补了这一空白,这是一个完全开放、成本效益高的视觉编码器系列,当集成到LLaVA等多模态框架中时,其性能可与OpenAI的CLIP相媲美甚至超越。OpenVision基于现有工作——例如采用CLIPS作为训练框架,Recap-DataComp-1B作为训练数据——同时揭示了提升编码器质量的多个关键见解,并展示了在推进多模态模型方面的实际优势。通过发布参数规模从590万到6.321亿不等的视觉编码器,OpenVision为实践者在构建多模态模型时提供了容量与效率之间的灵活权衡:更大模型带来更强的多模态性能,而更小版本则支持轻量级、适用于边缘设备的多模态部署。
English
OpenAI's CLIP, released in early 2021, have long been the go-to choice of
vision encoder for building multimodal foundation models. Although recent
alternatives such as SigLIP have begun to challenge this status quo, to our
knowledge none are fully open: their training data remains proprietary and/or
their training recipes are not released. This paper fills this gap with
OpenVision, a fully-open, cost-effective family of vision encoders that match
or surpass the performance of OpenAI's CLIP when integrated into multimodal
frameworks like LLaVA. OpenVision builds on existing works -- e.g., CLIPS for
training framework and Recap-DataComp-1B for training data -- while revealing
multiple key insights in enhancing encoder quality and showcasing practical
benefits in advancing multimodal models. By releasing vision encoders spanning
from 5.9M to 632.1M parameters, OpenVision offers practitioners a flexible
trade-off between capacity and efficiency in building multimodal models: larger
models deliver enhanced multimodal performance, while smaller versions enable
lightweight, edge-ready multimodal deployments.Summary
AI-Generated Summary