Pixtral 12B

要旨

Pixtral-12Bは、120億パラメータのマルチモーダル言語モデルです。 Pixtral-12Bは、自然画像と文書の両方を理解するように訓練されており、さまざまなマルチモーダルベンチマークで傑出したパフォーマンスを達成し、いくつかのより大きなモデルを凌駕しています。多くのオープンソースモデルとは異なり、Pixtralはそのサイズにおいても最先端のテキストモデルであり、マルチモーダルタスクで優れた性能を発揮するために自然言語のパフォーマンスを犠牲にしていません。Pixtralは、ゼロから訓練された新しいビジョンエンコーダを使用しており、これにより画像をその自然な解像度とアスペクト比で取り込むことができます。これにより、画像を処理するために使用されるトークンの数に柔軟性が生まれます。Pixtralは、128Kトークンの長いコンテキストウィンドウで任意の数の画像を処理することができます。Pixtral 12Bは、同様のサイズの他のオープンモデル（Llama-3.2 11B＆Qwen-2-VL 7B）よりも大幅に優れており、Llama-3.2 90Bなどのはるかに大きなオープンモデルを7倍小さくしながらも上回っています。さらに、実践的なシナリオでビジョン言語モデルを評価するためのオープンソースベンチマークであるMM-MT-Benchを提供し、マルチモーダルLLMの標準化された評価プロトコルの詳細な分析とコードを提供しています。Pixtral-12BはApache 2.0ライセンスの下でリリースされています。

English

We introduce Pixtral-12B, a 12--billion-parameter multimodal language model. Pixtral-12B is trained to understand both natural images and documents, achieving leading performance on various multimodal benchmarks, surpassing a number of larger models. Unlike many open-source models, Pixtral is also a cutting-edge text model for its size, and does not compromise on natural language performance to excel in multimodal tasks. Pixtral uses a new vision encoder trained from scratch, which allows it to ingest images at their natural resolution and aspect ratio. This gives users flexibility on the number of tokens used to process an image. Pixtral is also able to process any number of images in its long context window of 128K tokens. Pixtral 12B substanially outperforms other open models of similar sizes (Llama-3.2 11B \& Qwen-2-VL 7B). It also outperforms much larger open models like Llama-3.2 90B while being 7x smaller. We further contribute an open-source benchmark, MM-MT-Bench, for evaluating vision-language models in practical scenarios, and provide detailed analysis and code for standardized evaluation protocols for multimodal LLMs. Pixtral-12B is released under Apache 2.0 license.