Magistral

Resumen

Presentamos Magistral, el primer modelo de razonamiento de Mistral y nuestra propia canalización escalable de aprendizaje por refuerzo (RL). En lugar de depender de implementaciones existentes y trazas de RL destiladas de modelos previos, seguimos un enfoque desde cero, basándonos únicamente en nuestros propios modelos e infraestructura. En particular, demostramos una pila que nos permitió explorar los límites del entrenamiento puro de RL en modelos de lenguaje grandes (LLMs), presentamos un método simple para forzar el lenguaje de razonamiento del modelo y mostramos que el RL aplicado únicamente a datos de texto conserva la mayoría de las capacidades del punto de control inicial. Encontramos que el RL en texto mantiene o mejora la comprensión multimodal, el seguimiento de instrucciones y la llamada de funciones. Presentamos Magistral Medium, entrenado para razonamiento sobre Mistral Medium 3 utilizando únicamente RL, y liberamos Magistral Small (Apache 2.0) que además incluye datos de arranque en frío de Magistral Medium.

English

We introduce Magistral, Mistral's first reasoning model and our own scalable reinforcement learning (RL) pipeline. Instead of relying on existing implementations and RL traces distilled from prior models, we follow a ground up approach, relying solely on our own models and infrastructure. Notably, we demonstrate a stack that enabled us to explore the limits of pure RL training of LLMs, present a simple method to force the reasoning language of the model, and show that RL on text data alone maintains most of the initial checkpoint's capabilities. We find that RL on text maintains or improves multimodal understanding, instruction following and function calling. We present Magistral Medium, trained for reasoning on top of Mistral Medium 3 with RL alone, and we open-source Magistral Small (Apache 2.0) which further includes cold-start data from Magistral Medium.