THOR:基于强化学习的工具集成层次化数学推理优化
THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning
September 17, 2025
作者: Qikai Chang, Zhenrong Zhang, Pengfei Hu, Jiefeng Ma, Yicheng Pan, Jianshu Zhang, Jun Du, Quan Liu, Jianqing Gao
cs.AI
摘要
大型语言模型(LLMs)在数学推理方面取得了显著进展,但在高精度任务如数值计算和形式符号操作上仍面临挑战。整合外部工具已成为弥补这一差距的有前景的方法。尽管近期有所进展,现有方法在构建工具集成的推理数据、进行细粒度优化以及增强推理能力这三个关键挑战上仍显不足。为克服这些局限,我们提出了THOR(通过强化学习实现工具集成的层次优化)。首先,我们引入了TIRGen,一个基于多智能体演员-评论家框架的流程,用于构建高质量的工具集成推理路径数据集,该流程与策略对齐,并能很好地泛化到多种模型。其次,为实现细粒度的层次优化,我们提出了一种强化学习策略,该策略联合优化轨迹级问题解决与步骤级代码生成。这一策略源于我们的关键洞察:中间工具调用的成功是最终答案正确性的强预测指标。最后,THOR整合了一种自我修正机制,该机制利用即时工具反馈在推理过程中动态修正错误的推理路径。我们的方法展示了在多种模型上的强大泛化能力,在推理与非推理模型上均表现优异。此外,它在多个数学基准测试上达到了同规模模型的领先性能,同时在代码基准测试上也实现了持续改进。我们的代码将公开于https://github.com/JingMog/THOR。
English
Large Language Models (LLMs) have made remarkable progress in mathematical
reasoning, but still continue to struggle with high-precision tasks like
numerical computation and formal symbolic manipulation. Integrating external
tools has emerged as a promising approach to bridge this gap. Despite recent
advances, existing methods struggle with three key challenges: constructing
tool-integrated reasoning data, performing fine-grained optimization, and
enhancing inference. To overcome these limitations, we propose THOR
(Tool-Integrated Hierarchical Optimization via RL). First, we introduce TIRGen,
a multi-agent actor-critic-based pipeline for constructing high-quality
datasets of tool-integrated reasoning paths, aligning with the policy and
generalizing well across diverse models. Second, to perform fine-grained
hierarchical optimization, we introduce an RL strategy that jointly optimizes
for both trajectory-level problem solving and step-level code generation. This
is motivated by our key insight that the success of an intermediate tool call
is a strong predictor of the final answer's correctness. Finally, THOR
incorporates a self-correction mechanism that leverages immediate tool feedback
to dynamically revise erroneous reasoning paths during inference. Our approach
demonstrates strong generalization across diverse models, performing
effectively in both reasoning and non-reasoning models. It further achieves
state-of-the-art performance for models of a similar scale on multiple
mathematical benchmarks, while also delivering consistent improvements on code
benchmarks. Our code will be publicly available at
https://github.com/JingMog/THOR.