START: Raciocinador Autodidata com Ferramentas

Resumo

Grandes modelos de raciocínio (LRMs, do inglês Large Reasoning Models) como o OpenAI-o1 e o DeepSeek-R1 demonstraram capacidades notáveis em tarefas complexas de raciocínio por meio da utilização de longas cadeias de pensamento (CoT, do inglês Chain-of-Thought). No entanto, esses modelos frequentemente sofrem com alucinações e ineficiências devido à sua dependência exclusiva de processos internos de raciocínio. Neste artigo, apresentamos o START (Self-Taught Reasoner with Tools), um novo modelo de linguagem de raciocínio de longa CoT integrado a ferramentas, que aprimora significativamente as capacidades de raciocínio ao aproveitar ferramentas externas. Por meio da execução de código, o START é capaz de realizar cálculos complexos, auto-verificação, exploração de métodos diversos e auto-depuração, abordando assim as limitações dos LRMs. A inovação central do START reside em sua estrutura de autoaprendizagem, que compreende duas técnicas principais: 1) Hint-infer: Demonstramos que a inserção de dicas artificialmente projetadas (por exemplo, "Espere, talvez usar Python aqui seja uma boa ideia.") durante o processo de inferência de um LRM estimula efetivamente sua capacidade de utilizar ferramentas externas sem a necessidade de dados de demonstração. O Hint-infer também pode servir como um método simples e eficaz de escalonamento sequencial em tempo de teste; 2) Hint Rejection Sampling Fine-Tuning (Hint-RFT): O Hint-RFT combina o Hint-infer e o RFT ao pontuar, filtrar e modificar as trajetórias de raciocínio com invocação de ferramentas geradas por um LRM via Hint-infer, seguido pelo ajuste fino do LRM. Por meio dessa estrutura, ajustamos o modelo QwQ-32B para alcançar o START. Em questões de ciência de nível de doutorado (GPQA), benchmarks de matemática de nível de competição (AMC23, AIME24, AIME25) e o benchmark de código de nível de competição (LiveCodeBench), o START alcança taxas de precisão de 63,6%, 95,0%, 66,7%, 47,1% e 47,3%, respectivamente. Ele supera significativamente o QwQ-32B base e alcança desempenho comparável ao modelo de peso aberto de última geração R1-Distill-Qwen-32B e ao modelo proprietário o1-Preview.

English

Large reasoning models (LRMs) like OpenAI-o1 and DeepSeek-R1 have demonstrated remarkable capabilities in complex reasoning tasks through the utilization of long Chain-of-thought (CoT). However, these models often suffer from hallucinations and inefficiencies due to their reliance solely on internal reasoning processes. In this paper, we introduce START (Self-Taught Reasoner with Tools), a novel tool-integrated long CoT reasoning LLM that significantly enhances reasoning capabilities by leveraging external tools. Through code execution, START is capable of performing complex computations, self-checking, exploring diverse methods, and self-debugging, thereby addressing the limitations of LRMs. The core innovation of START lies in its self-learning framework, which comprises two key techniques: 1) Hint-infer: We demonstrate that inserting artificially designed hints (e.g., ``Wait, maybe using Python here is a good idea.'') during the inference process of a LRM effectively stimulates its ability to utilize external tools without the need for any demonstration data. Hint-infer can also serve as a simple and effective sequential test-time scaling method; 2) Hint Rejection Sampling Fine-Tuning (Hint-RFT): Hint-RFT combines Hint-infer and RFT by scoring, filtering, and modifying the reasoning trajectories with tool invocation generated by a LRM via Hint-infer, followed by fine-tuning the LRM. Through this framework, we have fine-tuned the QwQ-32B model to achieve START. On PhD-level science QA (GPQA), competition-level math benchmarks (AMC23, AIME24, AIME25), and the competition-level code benchmark (LiveCodeBench), START achieves accuracy rates of 63.6%, 95.0%, 66.7%, 47.1%, and 47.3%, respectively. It significantly outperforms the base QwQ-32B and achieves performance comparable to the state-of-the-art open-weight model R1-Distill-Qwen-32B and the proprietary model o1-Preview.

START: Raciocinador Autodidata com Ferramentas

START: Self-taught Reasoner with Tools

Resumo

Summary

Support

Support