Pixtral 12B
Pixtral 12B
October 9, 2024
作者: Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Devendra Chaplot, Jessica Chudnovsky, Saurabh Garg, Theophile Gervet, Soham Ghosh, Amélie Héliou, Paul Jacob, Albert Q. Jiang, Timothée Lacroix, Guillaume Lample, Diego Las Casas, Thibaut Lavril, Teven Le Scao, Andy Lo, William Marshall, Louis Martin, Arthur Mensch, Pavankumar Muddireddy, Valera Nemychnikova, Marie Pellat, Patrick Von Platen, Nikhil Raghuraman, Baptiste Rozière, Alexandre Sablayrolles, Lucile Saulnier, Romain Sauvestre, Wendy Shang, Roman Soletskyi, Lawrence Stewart, Pierre Stock, Joachim Studnia, Sandeep Subramanian, Sagar Vaze, Thomas Wang
cs.AI
摘要
我們介紹了 Pixtral-12B,一個擁有 120 億參數的多模態語言模型。Pixtral-12B 被訓練來理解自然圖像和文件,並在各種多模態基準測試中取得領先表現,超越許多更大的模型。與許多開源模型不同,Pixtral 也是一個在其尺寸上具有尖端文本模型,並且在多模態任務上表現出色而不會犧牲自然語言性能。Pixtral 使用了從頭訓練的新視覺編碼器,使其能夠以自然解析度和長寬比摄取圖像。這使用戶可以靈活地選擇用於處理圖像的標記數。Pixtral 還能夠在其長上下文窗口(128K 標記)中處理任意數量的圖像。Pixtral 12B 在性能上遠遠優於其他相似尺寸的開源模型(Llama-3.2 11B 和 Qwen-2-VL 7B)。它還在遠大於自身七倍的更大開源模型(Llama-3.2 90B)上取得了優異表現。我們進一步貢獻了一個開源基準測試 MM-MT-Bench,用於評估實際情境中的視覺語言模型,並為多模態 LLM 的標準評估協議提供了詳細分析和代碼。Pixtral-12B 釋出在 Apache 2.0 許可證下。
English
We introduce Pixtral-12B, a 12--billion-parameter multimodal language model.
Pixtral-12B is trained to understand both natural images and documents,
achieving leading performance on various multimodal benchmarks, surpassing a
number of larger models. Unlike many open-source models, Pixtral is also a
cutting-edge text model for its size, and does not compromise on natural
language performance to excel in multimodal tasks. Pixtral uses a new vision
encoder trained from scratch, which allows it to ingest images at their natural
resolution and aspect ratio. This gives users flexibility on the number of
tokens used to process an image. Pixtral is also able to process any number of
images in its long context window of 128K tokens. Pixtral 12B substanially
outperforms other open models of similar sizes (Llama-3.2 11B \& Qwen-2-VL 7B).
It also outperforms much larger open models like Llama-3.2 90B while being 7x
smaller. We further contribute an open-source benchmark, MM-MT-Bench, for
evaluating vision-language models in practical scenarios, and provide detailed
analysis and code for standardized evaluation protocols for multimodal LLMs.
Pixtral-12B is released under Apache 2.0 license.Summary
AI-Generated Summary