Satori：透過動作思維鏈的強化學習增強LLM推理，透過自回歸搜索

摘要

大型語言模型（LLMs）展示了在不同領域中卓越的推理能力。最近的研究表明，增加測試時的計算可以增強LLMs的推理能力。這通常涉及在推論時進行廣泛的採樣，由外部LLM驗證器指導，形成一個雙人系統。儘管受到外部指導，但這個系統的有效性展示了單個LLM應對複雜任務的潛力。因此，我們提出了一個新的研究問題：我們是否可以內部化搜索能力，從而從根本上增強單個LLM的推理能力？本文探討了一個正交方向，專注於用於自回歸搜索的事後訓練LLMs（即，通過自我反思和自我探索新策略進行擴展推理過程）。為了實現這一目標，我們提出了行動思維鏈（COAT）推理和兩階段訓練範式：1）小規模格式調整階段，以內部化COAT推理格式；2）大規模自我改進階段，利用強化學習。我們的方法產生了Satori，一個在開源模型和數據上訓練的7B LLM。廣泛的實證評估表明，Satori在數學推理基準測試中實現了最先進的性能，同時對領域外任務具有強大的泛化能力。代碼、數據和模型將完全開源。

English

Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains. Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities. This typically involves extensive sampling at inference time guided by an external LLM verifier, resulting in a two-player system. Despite external guidance, the effectiveness of this system demonstrates the potential of a single LLM to tackle complex tasks. Thus, we pose a new research problem: Can we internalize the searching capabilities to fundamentally enhance the reasoning abilities of a single LLM? This work explores an orthogonal direction focusing on post-training LLMs for autoregressive searching (i.e., an extended reasoning process with self-reflection and self-exploration of new strategies). To achieve this, we propose the Chain-of-Action-Thought (COAT) reasoning and a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning. Our approach results in Satori, a 7B LLM trained on open-source models and data. Extensive empirical evaluations demonstrate that Satori achieves state-of-the-art performance on mathematical reasoning benchmarks while exhibits strong generalization to out-of-domain tasks. Code, data, and models will be fully open-sourced.

Satori：透過動作思維鏈的強化學習增強LLM推理，透過自回歸搜索

Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search

摘要

Support