Molmo和PixMo：用于最先进多模态模型的开放权重和开放数据

摘要

如今，最先进的多模态模型仍然是专有的。最强大的开放权重模型在很大程度上依赖于专有VLMs生成的合成数据，以实现良好的性能，有效地将这些封闭模型提炼为开放模型。因此，社区仍然缺乏关于如何从零开始构建高性能VLMs的基础知识。我们提出了Molmo，这是一类在其开放性类别中处于最先进水平的新型VLMs。我们的关键创新是一种新颖、高度详细的图像描述数据集，完全由人类注释者使用基于语音的描述收集而成。为了实现各种用户交互，我们还引入了一个包含野外问答和创新的二维指向数据的多样化数据集混合用于微调。我们方法的成功依赖于对模型架构细节的精心选择、良好调整的训练流程，以及最为关键的是我们新收集的数据集的质量，所有这些都将被发布。Molmo系列中的最佳72B模型不仅在开放权重和数据模型类别中胜过其他模型，而且在学术基准测试和人类评估方面也与专有系统如GPT-4o、Claude 3.5和Gemini 1.5相比表现出色。我们将在不久的将来发布我们所有的模型权重、字幕和微调数据，以及源代码。部分模型权重、推断代码和演示可在https://molmo.allenai.org 上获取。

English

Today's most advanced multimodal models remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed models into open ones. As a result, the community is still missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key innovation is a novel, highly detailed image caption dataset collected entirely from human annotators using speech-based descriptions. To enable a wide array of user interactions, we also introduce a diverse dataset mixture for fine-tuning that includes in-the-wild Q&A and innovative 2D pointing data. The success of our approach relies on careful choices for the model architecture details, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets, all of which will be released. The best-in-class 72B model within the Molmo family not only outperforms others in the class of open weight and data models but also compares favorably against proprietary systems like GPT-4o, Claude 3.5, and Gemini 1.5 on both academic benchmarks and human evaluation. We will be releasing all of our model weights, captioning and fine-tuning data, and source code in the near future. Select model weights, inference code, and demo are available at https://molmo.allenai.org.

Molmo和PixMo：用于最先进多模态模型的开放权重和开放数据

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

摘要

Support