マジストラル

要旨

私たちは、Mistral初の推論モデルであるMagistralと、独自のスケーラブルな強化学習（RL）パイプラインを紹介します。既存の実装や先行モデルから蒸留されたRLトレースに依存するのではなく、独自のモデルとインフラストラクチャのみに基づいた一からのアプローチを採用しています。特に、純粋なRLトレーニングによるLLMの限界を探ることを可能にしたスタックを実証し、モデルの推論言語を強制するシンプルな方法を提示し、テキストデータのみでのRLが初期チェックポイントの能力の大部分を維持することを示します。テキストデータでのRLは、マルチモーダル理解、指示追従、関数呼び出しを維持または改善することがわかりました。私たちは、Mistral Medium 3の上にRLのみで推論のためにトレーニングされたMagistral Mediumを提示し、さらにMagistral Mediumからのコールドスタートデータを含むMagistral Small（Apache 2.0）をオープンソース化します。

English

We introduce Magistral, Mistral's first reasoning model and our own scalable reinforcement learning (RL) pipeline. Instead of relying on existing implementations and RL traces distilled from prior models, we follow a ground up approach, relying solely on our own models and infrastructure. Notably, we demonstrate a stack that enabled us to explore the limits of pure RL training of LLMs, present a simple method to force the reasoning language of the model, and show that RL on text data alone maintains most of the initial checkpoint's capabilities. We find that RL on text maintains or improves multimodal understanding, instruction following and function calling. We present Magistral Medium, trained for reasoning on top of Mistral Medium 3 with RL alone, and we open-source Magistral Small (Apache 2.0) which further includes cold-start data from Magistral Medium.