Satori:透過動作思維鏈的強化學習增強LLM推理,透過自回歸搜索
Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search
February 4, 2025
作者: Maohao Shen, Guangtao Zeng, Zhenting Qi, Zhang-Wei Hong, Zhenfang Chen, Wei Lu, Gregory Wornell, Subhro Das, David Cox, Chuang Gan
cs.AI
摘要
大型語言模型(LLMs)展示了在不同領域中卓越的推理能力。最近的研究表明,增加測試時的計算可以增強LLMs的推理能力。這通常涉及在推論時進行廣泛的採樣,由外部LLM驗證器指導,形成一個雙人系統。儘管受到外部指導,但這個系統的有效性展示了單個LLM應對複雜任務的潛力。因此,我們提出了一個新的研究問題:我們是否可以內部化搜索能力,從而從根本上增強單個LLM的推理能力?本文探討了一個正交方向,專注於用於自回歸搜索的事後訓練LLMs(即,通過自我反思和自我探索新策略進行擴展推理過程)。為了實現這一目標,我們提出了行動思維鏈(COAT)推理和兩階段訓練範式:1)小規模格式調整階段,以內部化COAT推理格式;2)大規模自我改進階段,利用強化學習。我們的方法產生了Satori,一個在開源模型和數據上訓練的7B LLM。廣泛的實證評估表明,Satori在數學推理基準測試中實現了最先進的性能,同時對領域外任務具有強大的泛化能力。代碼、數據和模型將完全開源。
English
Large language models (LLMs) have demonstrated remarkable reasoning
capabilities across diverse domains. Recent studies have shown that increasing
test-time computation enhances LLMs' reasoning capabilities. This typically
involves extensive sampling at inference time guided by an external LLM
verifier, resulting in a two-player system. Despite external guidance, the
effectiveness of this system demonstrates the potential of a single LLM to
tackle complex tasks. Thus, we pose a new research problem: Can we internalize
the searching capabilities to fundamentally enhance the reasoning abilities of
a single LLM? This work explores an orthogonal direction focusing on
post-training LLMs for autoregressive searching (i.e., an extended reasoning
process with self-reflection and self-exploration of new strategies). To
achieve this, we propose the Chain-of-Action-Thought (COAT) reasoning and a
two-stage training paradigm: 1) a small-scale format tuning stage to
internalize the COAT reasoning format and 2) a large-scale self-improvement
stage leveraging reinforcement learning. Our approach results in Satori, a 7B
LLM trained on open-source models and data. Extensive empirical evaluations
demonstrate that Satori achieves state-of-the-art performance on mathematical
reasoning benchmarks while exhibits strong generalization to out-of-domain
tasks. Code, data, and models will be fully open-sourced.Summary
AI-Generated Summary