BAPO:基於邊界感知的策略優化實現可靠智能搜索代理
BAPO: Boundary-Aware Policy Optimization for Reliable Agentic Search
January 16, 2026
作者: Shiyu Liu, Yongjing Yin, Jianhao Yan, Yunbo Tang, Qinggang Zhang, Bei Li, Xin Chen, Jingang Wang, Xunliang Cai, Jinsong Su
cs.AI
摘要
基於強化學習的智能體搜索技術使大型語言模型能夠透過動態規劃與外部搜索來解決複雜問題。儘管這種方法透過大規模強化學習優化的智能體策略顯著提升了準確性,我們發現其可靠性存在關鍵缺陷:這些智能體無法識別自身推理的邊界,即使在證據不足或推理達到極限時也極少承認「我不知道」(IDK)。這種可靠性的缺失常導致產生看似合理但不可靠的答案,在諸多現實場景中會引發重大風險。為此,我們提出邊界感知策略優化(BAPO),這是一種新穎的強化學習框架,旨在培養可靠的邊界意識而不犧牲準確性。BAPO包含兩個核心組件:(i)基於群組的邊界感知獎勵機制,僅在推理達到極限時激勵模型給出IDK回應;(ii)自適應獎勵調節器,在早期探索階段策略性地暫停此獎勵,防止模型將IDK作為捷徑濫用。在四個基準測試上的大量實驗表明,BAPO能顯著提升智能體搜索的整體可靠性。
English
RL-based agentic search enables LLMs to solve complex questions via dynamic planning and external search. While this approach significantly enhances accuracy with agent policies optimized via large-scale reinforcement learning, we identify a critical gap in reliability: these agents fail to recognize their reasoning boundaries and rarely admit ``I DON'T KNOW'' (IDK) even when evidence is insufficient or reasoning reaches its limit. The lack of reliability often leads to plausible but unreliable answers, introducing significant risks in many real-world scenarios. To this end, we propose Boundary-Aware Policy Optimization (BAPO), a novel RL framework designed to cultivate reliable boundary awareness without compromising accuracy. BAPO introduces two key components: (i) a group-based boundary-aware reward that encourages an IDK response only when the reasoning reaches its limit, and (ii) an adaptive reward modulator that strategically suspends this reward during early exploration, preventing the model from exploiting IDK as a shortcut. Extensive experiments on four benchmarks demonstrate that BAPO substantially enhances the overall reliability of agentic search.