MatchTIR:基于二分匹配的工具集成推理细粒度监督框架
MatchTIR: Fine-Grained Supervision for Tool-Integrated Reasoning via Bipartite Matching
January 15, 2026
作者: Changle Qu, Sunhao Dai, Hengyi Cai, Jun Xu, Shuaiqiang Wang, Dawei Yin
cs.AI
摘要
工具整合推理(TIR)通过将推理步骤与外部工具调用交错执行,使大语言模型能够处理复杂任务。然而,现有的强化学习方法通常依赖结果级或轨迹级奖励,对轨迹中的所有步骤赋予均等优势。这种粗粒度的信用分配机制无法区分有效工具调用与冗余或错误调用,尤其在长周期多轮交互场景中更为突出。为此,我们提出MatchTIR框架,通过基于二分图匹配的轮次级奖励分配和双层级优势估计实现细粒度监督。具体而言,我们将信用分配建模为预测轨迹与真实轨迹之间的二分图匹配问题,采用两种分配策略推导稠密的轮次级奖励。此外,为平衡局部步骤精度与全局任务成功率,我们引入双层级优势估计机制,整合轮次级与轨迹级信号,为每个交互轮次分配差异化优势值。在三个基准测试上的大量实验证明了MatchTIR的优越性。值得注意的是,我们的40亿参数模型在多数任务上超越80亿参数竞品,尤其在长周期和多轮任务中表现突出。代码已开源:https://github.com/quchangle1/MatchTIR。
English
Tool-Integrated Reasoning (TIR) empowers large language models (LLMs) to tackle complex tasks by interleaving reasoning steps with external tool interactions. However, existing reinforcement learning methods typically rely on outcome- or trajectory-level rewards, assigning uniform advantages to all steps within a trajectory. This coarse-grained credit assignment fails to distinguish effective tool calls from redundant or erroneous ones, particularly in long-horizon multi-turn scenarios. To address this, we propose MatchTIR, a framework that introduces fine-grained supervision via bipartite matching-based turn-level reward assignment and dual-level advantage estimation. Specifically, we formulate credit assignment as a bipartite matching problem between predicted and ground-truth traces, utilizing two assignment strategies to derive dense turn-level rewards. Furthermore, to balance local step precision with global task success, we introduce a dual-level advantage estimation scheme that integrates turn-level and trajectory-level signals, assigning distinct advantage values to individual interaction turns. Extensive experiments on three benchmarks demonstrate the superiority of MatchTIR. Notably, our 4B model surpasses the majority of 8B competitors, particularly in long-horizon and multi-turn tasks. Our codes are available at https://github.com/quchangle1/MatchTIR.