ChatPaper.aiChatPaper

迈向有效工具集成推理:通过自我进化偏好学习

Toward Effective Tool-Integrated Reasoning via Self-Evolved Preference Learning

September 27, 2025
作者: Yifei Chen, Guanting Dong, Zhicheng Dou
cs.AI

摘要

工具整合推理(Tool-Integrated Reasoning, TIR)使大型語言模型(LLMs)能夠通過整合外部工具來提升其內部推理能力。然而,採用TIR的模型常表現出次優行為,如工具使用不足或過度,以及在工具調用後的過度思考。如何激勵LLMs高效且準確地執行TIR,同時穩定推理過程,仍是一個未解之難題。本文首先從信息熵的角度探討工具調用對模型推理的影響。我們的研究發現,工具調用結果會導致後續推理的信息熵發生顯著變化,且推理鏈的整體熵會根據工具調用的數量而變化。基於這些洞察,我們提出了Tool-Light框架,旨在鼓勵LLMs高效且準確地執行TIR。該框架包括數據集構建和多階段微調。在數據集構建方面,我們採用微調模型進行連續自我進化採樣,整合了普通採樣和熵引導採樣。此外,我們在採樣過程中建立了嚴格的正面-負面樣本對選擇標準。訓練過程採用兩階段方法,包括監督微調(Supervised Fine-Tuning, SFT)和自我進化直接偏好優化(Direct Preference Optimization, DPO)。在10個數據集上的實驗結果證明了Tool-Light的有效性,顯著提升了模型執行TIR任務的效率。
English
Tool-Integrated Reasoning (TIR) enables large language models (LLMs) to improve their internal reasoning ability by integrating external tools. However, models employing TIR often display suboptimal behaviors, such as insufficient or excessive tool usage and overthinking after tool calls. The challenge of incentivizing LLMs to perform TIR efficiently and accurately, while stabilizing the reasoning process, remains an open question. In this paper, we start by exploring the impact of tool calls on model reasoning from the perspective of information entropy. Our findings indicate that tool call results lead to a distinct change in the information entropy of subsequent reasoning, with the overall entropy of the reasoning chain varying based on the number of tool calls. Building on these insights, we propose Tool-Light, a framework designed to encourage LLMs to perform TIR efficiently and accurately. Our framework includes dataset construction and multi-stage fine-tuning. For dataset construction, we employ continuous self-evolved sampling using the fine-tuned model, integrating both vanilla sampling and entropy-guided sampling. Besides, we establish strict criteria for selecting positive-negative pairs during sampling. The training process involves a two-stage approach, comprising Supervised Fine-Tuning (SFT) and Self-Evolved Direct Preference Optimization (DPO). Experimental results on 10 datasets demonstrate the effectiveness of Tool-Light, significantly improving the model's efficiency in executing TIR tasks.
PDF122September 30, 2025