ChatPaper.aiChatPaper

CoRT:思维中的代码集成推理

CoRT: Code-integrated Reasoning within Thinking

June 11, 2025
作者: Chengpeng Li, Zhengyang Tang, Ziniu Li, Mingfeng Xue, Keqin Bao, Tian Ding, Ruoyu Sun, Benyou Wang, Xiang Wang, Junyang Lin, Dayiheng Liu
cs.AI

摘要

诸如o1和DeepSeek-R1等大型推理模型(LRMs)在自然语言的长链思维(CoT)推理方面展现了显著进展,但在处理复杂数学运算时仍显效率低下或准确性不足。通过计算工具(如计算库和符号求解器)来解决这些限制颇具前景,但这也引入了一个技术挑战:代码解释器(CI)带来了超越模型内部文本表示的外部知识,因此直接结合并不高效。本文提出了CoRT,一个用于教导LRMs有效且高效利用CI的后训练框架。作为第一步,我们通过提示工程(Hint-Engineering)合成代码集成的推理数据,以解决数据稀缺问题,该方法策略性地在适当位置插入不同提示,以优化LRM与CI的交互。我们手动创建了30个高质量样本,并在此基础上对参数规模从1.5B到32B的模型进行了监督微调、拒绝微调和强化学习的后训练。实验结果表明,采用提示工程的模型在DeepSeek-R1-Distill-Qwen-32B和DeepSeek-R1-Distill-Qwen-1.5B上,分别在五个具有挑战性的数学推理数据集上实现了4%和8%的绝对提升。此外,与自然语言模型相比,提示工程模型在32B模型上减少了约30%的token使用量,在1.5B模型上减少了50%。模型和代码可在https://github.com/ChengpengLi1003/CoRT获取。
English
Large Reasoning Models (LRMs) like o1 and DeepSeek-R1 have shown remarkable progress in natural language reasoning with long chain-of-thought (CoT), yet they remain inefficient or inaccurate when handling complex mathematical operations. Addressing these limitations through computational tools (e.g., computation libraries and symbolic solvers) is promising, but it introduces a technical challenge: Code Interpreter (CI) brings external knowledge beyond the model's internal text representations, thus the direct combination is not efficient. This paper introduces CoRT, a post-training framework for teaching LRMs to leverage CI effectively and efficiently. As a first step, we address the data scarcity issue by synthesizing code-integrated reasoning data through Hint-Engineering, which strategically inserts different hints at appropriate positions to optimize LRM-CI interaction. We manually create 30 high-quality samples, upon which we post-train models ranging from 1.5B to 32B parameters, with supervised fine-tuning, rejection fine-tuning and reinforcement learning. Our experimental results demonstrate that Hint-Engineering models achieve 4\% and 8\% absolute improvements on DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Qwen-1.5B respectively, across five challenging mathematical reasoning datasets. Furthermore, Hint-Engineering models use about 30\% fewer tokens for the 32B model and 50\% fewer tokens for the 1.5B model compared with the natural language models. The models and code are available at https://github.com/ChengpengLi1003/CoRT.
PDF142June 13, 2025