自己進化的選好学習による効果的なツール統合推論に向けて

要旨

ツール統合推論（Tool-Integrated Reasoning, TIR）は、外部ツールを統合することで大規模言語モデル（LLMs）の内部推論能力を向上させる手法である。しかし、TIRを採用したモデルは、ツールの使用が不十分または過剰であることや、ツール呼び出し後の過剰な思考といった最適でない振る舞いを示すことが多い。LLMsにTIRを効率的かつ正確に実行させ、推論プロセスを安定化させるためのインセンティブ設計は、未解決の課題である。本論文では、まず情報エントロピーの観点からツール呼び出しがモデルの推論に与える影響を探る。その結果、ツール呼び出しの結果は後続の推論の情報エントロピーに明確な変化をもたらし、推論連鎖全体のエントロピーはツール呼び出しの数に応じて変動することが明らかとなった。これらの知見に基づき、LLMsにTIRを効率的かつ正確に実行させることを目的としたフレームワーク「Tool-Light」を提案する。本フレームワークは、データセット構築と多段階のファインチューニングを含む。データセット構築では、ファインチューニングされたモデルを用いた連続的な自己進化サンプリングを採用し、通常のサンプリングとエントロピー誘導サンプリングを統合する。さらに、サンプリング中のポジティブ-ネガティブペアの選択に厳格な基準を設ける。訓練プロセスは、教師ありファインチューニング（Supervised Fine-Tuning, SFT）と自己進化直接選好最適化（Self-Evolved Direct Preference Optimization, DPO）の2段階アプローチを採用する。10のデータセットにおける実験結果は、Tool-LightがTIRタスクの実行効率を大幅に向上させる有効性を示している。

English

Tool-Integrated Reasoning (TIR) enables large language models (LLMs) to improve their internal reasoning ability by integrating external tools. However, models employing TIR often display suboptimal behaviors, such as insufficient or excessive tool usage and overthinking after tool calls. The challenge of incentivizing LLMs to perform TIR efficiently and accurately, while stabilizing the reasoning process, remains an open question. In this paper, we start by exploring the impact of tool calls on model reasoning from the perspective of information entropy. Our findings indicate that tool call results lead to a distinct change in the information entropy of subsequent reasoning, with the overall entropy of the reasoning chain varying based on the number of tool calls. Building on these insights, we propose Tool-Light, a framework designed to encourage LLMs to perform TIR efficiently and accurately. Our framework includes dataset construction and multi-stage fine-tuning. For dataset construction, we employ continuous self-evolved sampling using the fine-tuned model, integrating both vanilla sampling and entropy-guided sampling. Besides, we establish strict criteria for selecting positive-negative pairs during sampling. The training process involves a two-stage approach, comprising Supervised Fine-Tuning (SFT) and Self-Evolved Direct Preference Optimization (DPO). Experimental results on 10 datasets demonstrate the effectiveness of Tool-Light, significantly improving the model's efficiency in executing TIR tasks.

自己進化的選好学習による効果的なツール統合推論に向けて

Toward Effective Tool-Integrated Reasoning via Self-Evolved Preference Learning

要旨

Support