START : Raisonneur Autodidacte avec Outils

Résumé

Les grands modèles de raisonnement (LRM) tels qu'OpenAI-o1 et DeepSeek-R1 ont démontré des capacités remarquables dans les tâches de raisonnement complexe grâce à l'utilisation de longues chaînes de pensée (Chain-of-thought, CoT). Cependant, ces modèles souffrent souvent d'hallucinations et d'inefficacités en raison de leur dépendance exclusive aux processus de raisonnement internes. Dans cet article, nous présentons START (Self-Taught Reasoner with Tools), un nouveau modèle de langage à grande échelle (LLM) intégrant des outils pour le raisonnement en CoT long, qui améliore significativement les capacités de raisonnement en exploitant des outils externes. Grâce à l'exécution de code, START est capable d'effectuer des calculs complexes, de s'auto-vérifier, d'explorer diverses méthodes et de s'auto-déboguer, répondant ainsi aux limitations des LRM. L'innovation centrale de START réside dans son cadre d'auto-apprentissage, qui comprend deux techniques clés : 1) Hint-infer : Nous démontrons que l'insertion d'indices artificiellement conçus (par exemple, « Attendez, peut-être qu'utiliser Python ici est une bonne idée ») pendant le processus d'inférence d'un LRM stimule efficacement sa capacité à utiliser des outils externes sans nécessiter de données de démonstration. Hint-infer peut également servir de méthode simple et efficace de mise à l'échelle séquentielle en temps de test ; 2) Hint Rejection Sampling Fine-Tuning (Hint-RFT) : Hint-RFT combine Hint-infer et RFT en notant, filtrant et modifiant les trajectoires de raisonnement avec invocation d'outils générées par un LRM via Hint-infer, suivi d'un fine-tuning du LRM. Grâce à ce cadre, nous avons fine-tuné le modèle QwQ-32B pour obtenir START. Sur des questions scientifiques de niveau doctorat (GPQA), des benchmarks mathématiques de niveau compétition (AMC23, AIME24, AIME25) et le benchmark de code de niveau compétition (LiveCodeBench), START atteint des taux de précision de 63,6 %, 95,0 %, 66,7 %, 47,1 % et 47,3 %, respectivement. Il surpasse significativement le modèle de base QwQ-32B et atteint des performances comparables au modèle open-weight de pointe R1-Distill-Qwen-32B et au modèle propriétaire o1-Preview.

English

Large reasoning models (LRMs) like OpenAI-o1 and DeepSeek-R1 have demonstrated remarkable capabilities in complex reasoning tasks through the utilization of long Chain-of-thought (CoT). However, these models often suffer from hallucinations and inefficiencies due to their reliance solely on internal reasoning processes. In this paper, we introduce START (Self-Taught Reasoner with Tools), a novel tool-integrated long CoT reasoning LLM that significantly enhances reasoning capabilities by leveraging external tools. Through code execution, START is capable of performing complex computations, self-checking, exploring diverse methods, and self-debugging, thereby addressing the limitations of LRMs. The core innovation of START lies in its self-learning framework, which comprises two key techniques: 1) Hint-infer: We demonstrate that inserting artificially designed hints (e.g., ``Wait, maybe using Python here is a good idea.'') during the inference process of a LRM effectively stimulates its ability to utilize external tools without the need for any demonstration data. Hint-infer can also serve as a simple and effective sequential test-time scaling method; 2) Hint Rejection Sampling Fine-Tuning (Hint-RFT): Hint-RFT combines Hint-infer and RFT by scoring, filtering, and modifying the reasoning trajectories with tool invocation generated by a LRM via Hint-infer, followed by fine-tuning the LRM. Through this framework, we have fine-tuned the QwQ-32B model to achieve START. On PhD-level science QA (GPQA), competition-level math benchmarks (AMC23, AIME24, AIME25), and the competition-level code benchmark (LiveCodeBench), START achieves accuracy rates of 63.6%, 95.0%, 66.7%, 47.1%, and 47.3%, respectively. It significantly outperforms the base QwQ-32B and achieves performance comparable to the state-of-the-art open-weight model R1-Distill-Qwen-32B and the proprietary model o1-Preview.

START : Raisonneur Autodidacte avec Outils

START: Self-taught Reasoner with Tools

Résumé

Support