THOR:基於強化學習的工具整合式分層優化數學推理
THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning
September 17, 2025
作者: Qikai Chang, Zhenrong Zhang, Pengfei Hu, Jiefeng Ma, Yicheng Pan, Jianshu Zhang, Jun Du, Quan Liu, Jianqing Gao
cs.AI
摘要
大型語言模型(LLMs)在數學推理方面取得了顯著進展,但在高精度任務(如數值計算和形式符號操作)上仍面臨挑戰。整合外部工具已成為彌補這一差距的有前景的方法。儘管最近有所進展,現有方法仍面臨三個關鍵挑戰:構建工具整合的推理數據、進行細粒度優化以及增強推理能力。為克服這些限制,我們提出了THOR(基於強化學習的工具整合層次優化)。首先,我們引入了TIRGen,這是一個基於多智能體演員-評論家架構的管道,用於構建高質量的工具整合推理路徑數據集,與策略保持一致並在多樣化模型中表現出良好的泛化能力。其次,為了進行細粒度的層次優化,我們引入了一種強化學習策略,該策略聯合優化軌跡級問題解決和步驟級代碼生成。這基於我們的一個關鍵洞察:中間工具調用的成功是預測最終答案正確性的強有力指標。最後,THOR整合了一種自我校正機制,利用即時工具反饋在推理過程中動態修正錯誤的推理路徑。我們的方法在多樣化模型中展現出強大的泛化能力,無論是在推理模型還是非推理模型中都表現出色。它進一步在多個數學基準測試中實現了同規模模型的最先進性能,同時在代碼基準測試中也帶來了持續的改進。我們的代碼將在https://github.com/JingMog/THOR公開提供。
English
Large Language Models (LLMs) have made remarkable progress in mathematical
reasoning, but still continue to struggle with high-precision tasks like
numerical computation and formal symbolic manipulation. Integrating external
tools has emerged as a promising approach to bridge this gap. Despite recent
advances, existing methods struggle with three key challenges: constructing
tool-integrated reasoning data, performing fine-grained optimization, and
enhancing inference. To overcome these limitations, we propose THOR
(Tool-Integrated Hierarchical Optimization via RL). First, we introduce TIRGen,
a multi-agent actor-critic-based pipeline for constructing high-quality
datasets of tool-integrated reasoning paths, aligning with the policy and
generalizing well across diverse models. Second, to perform fine-grained
hierarchical optimization, we introduce an RL strategy that jointly optimizes
for both trajectory-level problem solving and step-level code generation. This
is motivated by our key insight that the success of an intermediate tool call
is a strong predictor of the final answer's correctness. Finally, THOR
incorporates a self-correction mechanism that leverages immediate tool feedback
to dynamically revise erroneous reasoning paths during inference. Our approach
demonstrates strong generalization across diverse models, performing
effectively in both reasoning and non-reasoning models. It further achieves
state-of-the-art performance for models of a similar scale on multiple
mathematical benchmarks, while also delivering consistent improvements on code
benchmarks. Our code will be publicly available at
https://github.com/JingMog/THOR.