康布里亚-1：一项完全开放的、以视觉为中心的多模态LLM探索

摘要

我们介绍了Cambrian-1，这是一系列采用以视觉为中心的方法设计的多模态语言模型（MLLMs）。尽管更强大的语言模型可以增强多模态能力，但对于视觉组件的设计选择往往未经充分探讨，与视觉表示学习研究脱节。这种差距阻碍了在现实场景中准确的感官基础。我们的研究利用LLMs和视觉指导调整作为接口，评估各种视觉表示，为不同模型和架构提供新的见解，这些模型和架构可以是自监督的、强监督的，或者二者的组合，基于对超过20个视觉编码器进行的实验。我们对现有的MLLM基准进行了批判性审查，解决了整合和解释来自各种任务结果的困难，并引入了一个新的以视觉为中心的基准，CV-Bench。为了进一步改善视觉基础，我们提出了空间视觉聚合器（SVA），这是一个动态的、具有空间意识的连接器，将高分辨率视觉特征与LLMs集成在一起，同时减少了令牌的数量。此外，我们讨论了从公开来源获取高质量视觉指导调整数据的策划，强调了数据来源平衡和分布比例的重要性。总的来说，Cambrian-1不仅实现了最先进的性能，还作为一本全面的、开放的指导调整MLLMs的食谱。我们提供模型权重、代码、支持工具、数据集以及详细的指导调整和评估方法。我们希望我们的发布能激发并加速多模态系统和视觉表示学习的进展。

English

We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. While stronger language models can enhance multimodal capabilities, the design choices for vision components are often insufficiently explored and disconnected from visual representation learning research. This gap hinders accurate sensory grounding in real-world scenarios. Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations, offering new insights into different models and architectures -- self-supervised, strongly supervised, or combinations thereof -- based on experiments with over 20 vision encoders. We critically examine existing MLLM benchmarks, addressing the difficulties involved in consolidating and interpreting results from various tasks, and introduce a new vision-centric benchmark, CV-Bench. To further improve visual grounding, we propose the Spatial Vision Aggregator (SVA), a dynamic and spatially-aware connector that integrates high-resolution vision features with LLMs while reducing the number of tokens. Additionally, we discuss the curation of high-quality visual instruction-tuning data from publicly available sources, emphasizing the importance of data source balancing and distribution ratio. Collectively, Cambrian-1 not only achieves state-of-the-art performance but also serves as a comprehensive, open cookbook for instruction-tuned MLLMs. We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes. We hope our release will inspire and accelerate advancements in multimodal systems and visual representation learning.

康布里亚-1：一项完全开放的、以视觉为中心的多模态LLM探索

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

摘要

Support