採用同策略內在知識邊界增強的高效能代理強化學習
Efficient Agentic Reinforcement Learning with On-Policy Intrinsic Knowledge Boundary Enhancement
May 26, 2026
作者: Dingwei Chen, Zefang Zong, Zhipeng Ma, Leo Luo, Yang Li, Chengming Li, Peng Chen, Jie Jiang
cs.AI
摘要
代理強化學習(Agentic RL)已被證明能有效訓練具備外部工具使用能力的大型語言模型(LLM)代理。然而,我們發現代理強化學習訓練會導致冗余的工具調用增加,並模糊模型內在的知識邊界——模型難以區分何時需要工具、何時僅靠參數化知識就已足夠。現有的基於獎勵形塑(reward shaping)的解決方案提供粗粒度的優化目標,往往傾向於無差別地抑制工具調用,從而引發獎勵駭客(reward hacking)問題。本文提出AKBE(代理知識邊界增強),這是一種同策略(on-policy)方法,通過在訓練過程中進行雙路徑(有工具/無工具)的軌跡生成,動態探測模型的內在知識邊界。我們將知識邊界定義為:針對每個實例判斷是否需要工具,以及所需的最小工具調用次數。通過比較不同路徑的正確性,AKBE對軌跡進行分類,並構建針對性的監督信號,引導模型針對每個問題採用高效的工具使用模式。這些信號能被無縫整合到代理強化學習的訓練循環中。在七個問答基準上的實驗表明,與標準代理強化學習相比,AKBE平均使任務準確率提升1.85%,並將工具調用次數減少18%,在不犧牲準確率與效率的前提下,工具生產力提高25%。進一步分析顯示,該方法在不同強化學習演算法之間具有即插即用的兼容性,並證明了各信號類別的運作機制。我們的程式碼已開源於 https://github.com/CuSO4-Chen/AKBE。
English
Agentic reinforcement learning (RL) has proven effective for training LLM-based agents with external tool-use capabilities. However, we identify that agentic RL training induces increasing redundant tool calls and blurs the model's intrinsic knowledge boundary, where the model fails to distinguish when tools are needed versus when parametric knowledge suffices. Existing solutions based on reward shaping create coarse-grained optimization targets that tend to incentivize indiscriminate tool-call suppression, leading to reward hacking. In this paper, we propose AKBE (Agentic Knowledge Boundary Enhancement), an on-policy method that dynamically probes the model's intrinsic knowledge boundary through dual-path (with-tool and no-tool) rollouts during training. We define the knowledge boundary as the per-instance determination of whether tools are required and the minimum tool calls necessary. By comparing correctness across paths, AKBE categorizes trajectories and constructs targeted supervisory signals that guide efficient tool-use patterns for each question. These signals are integrated seamlessly into the agentic RL training loop. Experiments on seven QA benchmarks demonstrate that AKBE improves task accuracy by +1.85 on average and reduces tool calls by 18% over standard agentic RL, yielding 25% higher tool productivity without any accuracy-efficiency trade-off. Further analysis suggests its plug-and-play compatibility across different RL algorithms and the mechanism of each signal category. Our code is available at https://github.com/CuSO4-Chen/AKBE.