ChatPaper.aiChatPaper

Molmo 和 PixMo:開放權重和開放數據用於最先進的多模態模型

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

September 25, 2024
作者: Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvonne Chou, Arnavi Chheda, Jenna Sparks, Sam Skjonsberg, Michael Schmitz, Aaron Sarnat, Byron Bischoff, Pete Walsh, Chris Newell, Piper Wolters, Tanmay Gupta, Kuo-Hao Zeng, Jon Borchardt, Dirk Groeneveld, Jen Dumas, Crystal Nam, Sophie Lebrecht, Caitlin Wittlif, Carissa Schoenick, Oscar Michel, Ranjay Krishna, Luca Weihs, Noah A. Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, Aniruddha Kembhavi
cs.AI

摘要

目前最先進的多模型仍然是專有的。最強大的開放權重模型主要依賴來自專有 VLM 的合成數據以達到良好性能,有效地將這些封閉模型提煉為開放模型。因此,社群仍然缺乏如何從頭開始構建高性能 VLM 的基礎知識。我們提出了 Molmo,這是一個在其開放性類別中處於最前沿的新型 VLM 系列。我們的關鍵創新是通過語音描述完全由人類標註者收集的一個新穎、高度詳細的圖像標題數據集。為了實現各種用戶互動,我們還引入了一個包含野外問答和創新的 2D 指向數據的多樣數據集混合進行微調。我們方法的成功取決於對模型架構細節的慎重選擇、良好調校的訓練流程,以及最為關鍵的是我們新收集的數據集的質量,所有這些將被釋出。Molmo 系列中的最佳 72B 模型不僅在開放權重和數據模型類別中優於其他模型,還在學術基準和人類評估中與像 GPT-4o、Claude 3.5 和 Gemini 1.5 這樣的專有系統相比表現出色。 我們將在不久的將來釋出所有模型權重、標題和微調數據,以及源代碼。可在 https://molmo.allenai.org 獲取部分模型權重、推理代碼和演示。
English
Today's most advanced multimodal models remain proprietary. The strongest open-weight models rely heavily on synthetic data from proprietary VLMs to achieve good performance, effectively distilling these closed models into open ones. As a result, the community is still missing foundational knowledge about how to build performant VLMs from scratch. We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. Our key innovation is a novel, highly detailed image caption dataset collected entirely from human annotators using speech-based descriptions. To enable a wide array of user interactions, we also introduce a diverse dataset mixture for fine-tuning that includes in-the-wild Q&A and innovative 2D pointing data. The success of our approach relies on careful choices for the model architecture details, a well-tuned training pipeline, and, most critically, the quality of our newly collected datasets, all of which will be released. The best-in-class 72B model within the Molmo family not only outperforms others in the class of open weight and data models but also compares favorably against proprietary systems like GPT-4o, Claude 3.5, and Gemini 1.5 on both academic benchmarks and human evaluation. We will be releasing all of our model weights, captioning and fine-tuning data, and source code in the near future. Select model weights, inference code, and demo are available at https://molmo.allenai.org.

Summary

AI-Generated Summary

PDF1144November 16, 2024