ChatPaper.aiChatPaper

Pixtral 12B

Pixtral 12B

October 9, 2024
作者: Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Devendra Chaplot, Jessica Chudnovsky, Saurabh Garg, Theophile Gervet, Soham Ghosh, Amélie Héliou, Paul Jacob, Albert Q. Jiang, Timothée Lacroix, Guillaume Lample, Diego Las Casas, Thibaut Lavril, Teven Le Scao, Andy Lo, William Marshall, Louis Martin, Arthur Mensch, Pavankumar Muddireddy, Valera Nemychnikova, Marie Pellat, Patrick Von Platen, Nikhil Raghuraman, Baptiste Rozière, Alexandre Sablayrolles, Lucile Saulnier, Romain Sauvestre, Wendy Shang, Roman Soletskyi, Lawrence Stewart, Pierre Stock, Joachim Studnia, Sandeep Subramanian, Sagar Vaze, Thomas Wang
cs.AI

摘要

我们介绍了Pixtral-12B,一个拥有120亿参数的多模态语言模型。Pixtral-12B经过训练,能够理解自然图像和文档,在各种多模态基准测试中取得领先表现,超过许多更大的模型。与许多开源模型不同,Pixtral也是一款尺寸上的尖端文本模型,并且在多模态任务中表现出色并不会牺牲自然语言性能。Pixtral使用了一个从头开始训练的新视觉编码器,使其能够以自然分辨率和宽高比处理图像。这使用户可以灵活地选择处理图像所需的令牌数量。Pixtral还能够在其长上下文窗口(128K令牌)中处理任意数量的图像。Pixtral 12B在性能上远远超过了其他尺寸相似的开源模型(如Llama-3.2 11B和Qwen-2-VL 7B)。它还在性能上超越了尺寸大得多的开源模型,如Llama-3.2 90B,同时体积却小了7倍。我们进一步贡献了一个开源基准测试,MM-MT-Bench,用于评估视觉-语言模型在实际场景中的表现,并提供了详细的分析和用于多模态LLM的标准化评估协议的代码。Pixtral-12B采用Apache 2.0许可发布。
English
We introduce Pixtral-12B, a 12--billion-parameter multimodal language model. Pixtral-12B is trained to understand both natural images and documents, achieving leading performance on various multimodal benchmarks, surpassing a number of larger models. Unlike many open-source models, Pixtral is also a cutting-edge text model for its size, and does not compromise on natural language performance to excel in multimodal tasks. Pixtral uses a new vision encoder trained from scratch, which allows it to ingest images at their natural resolution and aspect ratio. This gives users flexibility on the number of tokens used to process an image. Pixtral is also able to process any number of images in its long context window of 128K tokens. Pixtral 12B substanially outperforms other open models of similar sizes (Llama-3.2 11B \& Qwen-2-VL 7B). It also outperforms much larger open models like Llama-3.2 90B while being 7x smaller. We further contribute an open-source benchmark, MM-MT-Bench, for evaluating vision-language models in practical scenarios, and provide detailed analysis and code for standardized evaluation protocols for multimodal LLMs. Pixtral-12B is released under Apache 2.0 license.

Summary

AI-Generated Summary

PDF665November 16, 2024