康布里亞-1:一個完全開放、以視覺為中心的多模態LLM探索
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
June 24, 2024
作者: Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Austin Wang, Rob Fergus, Yann LeCun, Saining Xie
cs.AI
摘要
我們介紹了Cambrian-1,這是一系列以視覺為中心設計的多模式語言模型(MLLMs)。儘管更強大的語言模型可以增強多模式能力,但對於視覺組件的設計選擇往往未受足夠探索,並與視覺表示學習研究脫節。這種差距阻礙了在現實場景中的準確感官基礎。我們的研究使用LLMs和視覺指導調整作為一個界面,評估各種視覺表示,提供對不同模型和架構的新見解,根據對超過20個視覺編碼器進行的實驗,包括自監督、強監督或其組合。我們對現有的MLLM基準進行了批判性檢查,解決了整合和解釋來自各種任務結果的困難,並引入了一個新的以視覺為中心的基準,CV-Bench。為了進一步改善視覺基礎,我們提出了空間視覺聚合器(SVA),這是一個動態且具有空間感知的連接器,將高分辨率視覺特徵與LLMs整合在一起,同時減少標記數。此外,我們討論了從公開可用來源中精心挑選高質量的視覺指導調整數據,強調了數據來源平衡和分配比例的重要性。總的來說,Cambrian-1不僅實現了最先進的性能,還作為一本全面的、開放的指導調整MLLMs的食譜。我們提供模型權重、代碼、支持工具、數據集,以及詳細的指導調整和評估配方。我們希望我們的發布將激發並加速多模式系統和視覺表示學習的進步。
English
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a
vision-centric approach. While stronger language models can enhance multimodal
capabilities, the design choices for vision components are often insufficiently
explored and disconnected from visual representation learning research. This
gap hinders accurate sensory grounding in real-world scenarios. Our study uses
LLMs and visual instruction tuning as an interface to evaluate various visual
representations, offering new insights into different models and architectures
-- self-supervised, strongly supervised, or combinations thereof -- based on
experiments with over 20 vision encoders. We critically examine existing MLLM
benchmarks, addressing the difficulties involved in consolidating and
interpreting results from various tasks, and introduce a new vision-centric
benchmark, CV-Bench. To further improve visual grounding, we propose the
Spatial Vision Aggregator (SVA), a dynamic and spatially-aware connector that
integrates high-resolution vision features with LLMs while reducing the number
of tokens. Additionally, we discuss the curation of high-quality visual
instruction-tuning data from publicly available sources, emphasizing the
importance of data source balancing and distribution ratio. Collectively,
Cambrian-1 not only achieves state-of-the-art performance but also serves as a
comprehensive, open cookbook for instruction-tuned MLLMs. We provide model
weights, code, supporting tools, datasets, and detailed instruction-tuning and
evaluation recipes. We hope our release will inspire and accelerate
advancements in multimodal systems and visual representation learning.Summary
AI-Generated Summary