Molmo和PixMo:用于最先进多模态模型的开放权重和开放数据
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
September 25, 2024
作者: Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvonne Chou, Arnavi Chheda, Jenna Sparks, Sam Skjonsberg, Michael Schmitz, Aaron Sarnat, Byron Bischoff, Pete Walsh, Chris Newell, Piper Wolters, Tanmay Gupta, Kuo-Hao Zeng, Jon Borchardt, Dirk Groeneveld, Jen Dumas, Crystal Nam, Sophie Lebrecht, Caitlin Wittlif, Carissa Schoenick, Oscar Michel, Ranjay Krishna, Luca Weihs, Noah A. Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, Aniruddha Kembhavi
cs.AI
摘要
如今,最先进的多模态模型仍然是专有的。最强大的开放权重模型在很大程度上依赖于专有VLMs生成的合成数据,以实现良好的性能,有效地将这些封闭模型提炼为开放模型。因此,社区仍然缺乏关于如何从零开始构建高性能VLMs的基础知识。我们提出了Molmo,这是一类在其开放性类别中处于最先进水平的新型VLMs。我们的关键创新是一种新颖、高度详细的图像描述数据集,完全由人类注释者使用基于语音的描述收集而成。为了实现各种用户交互,我们还引入了一个包含野外问答和创新的二维指向数据的多样化数据集混合用于微调。我们方法的成功依赖于对模型架构细节的精心选择、良好调整的训练流程,以及最为关键的是我们新收集的数据集的质量,所有这些都将被发布。Molmo系列中的最佳72B模型不仅在开放权重和数据模型类别中胜过其他模型,而且在学术基准测试和人类评估方面也与专有系统如GPT-4o、Claude 3.5和Gemini 1.5相比表现出色。
我们将在不久的将来发布我们所有的模型权重、字幕和微调数据,以及源代码。部分模型权重、推断代码和演示可在https://molmo.allenai.org 上获取。
English
Today's most advanced multimodal models remain proprietary. The strongest
open-weight models rely heavily on synthetic data from proprietary VLMs to
achieve good performance, effectively distilling these closed models into open
ones. As a result, the community is still missing foundational knowledge about
how to build performant VLMs from scratch. We present Molmo, a new family of
VLMs that are state-of-the-art in their class of openness. Our key innovation
is a novel, highly detailed image caption dataset collected entirely from human
annotators using speech-based descriptions. To enable a wide array of user
interactions, we also introduce a diverse dataset mixture for fine-tuning that
includes in-the-wild Q&A and innovative 2D pointing data. The success of our
approach relies on careful choices for the model architecture details, a
well-tuned training pipeline, and, most critically, the quality of our newly
collected datasets, all of which will be released. The best-in-class 72B model
within the Molmo family not only outperforms others in the class of open weight
and data models but also compares favorably against proprietary systems like
GPT-4o, Claude 3.5, and Gemini 1.5 on both academic benchmarks and human
evaluation.
We will be releasing all of our model weights, captioning and fine-tuning
data, and source code in the near future. Select model weights, inference code,
and demo are available at https://molmo.allenai.org.Summary
AI-Generated Summary