ChatPaper.aiChatPaper

BAPO:面向可靠智能搜索的边界感知策略优化

BAPO: Boundary-Aware Policy Optimization for Reliable Agentic Search

January 16, 2026
作者: Shiyu Liu, Yongjing Yin, Jianhao Yan, Yunbo Tang, Qinggang Zhang, Bei Li, Xin Chen, Jingang Wang, Xunliang Cai, Jinsong Su
cs.AI

摘要

基于强化学习的智能搜索代理使大语言模型能够通过动态规划与外部搜索解决复杂问题。尽管该方法通过大规模强化学习优化的智能体策略显著提升了准确率,但我们发现其可靠性存在关键缺陷:这些智能体无法识别自身推理边界,即便在证据不足或推理达到极限时也极少承认"我不知道"(IDK)。这种可靠性的缺失往往导致生成看似合理但不可靠的答案,为众多实际场景带来重大风险。为此,我们提出边界感知策略优化(BAPO),这是一种新型强化学习框架,旨在培养可靠的边界意识而不牺牲准确性。BAPO引入两大核心组件:(1)基于分组的边界感知奖励机制,仅在推理达到极限时鼓励模型输出IDK响应;(2)自适应奖励调节器,在早期探索阶段策略性暂停该奖励,防止模型将IDK作为捷径进行利用。在四个基准测试上的大量实验表明,BAPO能显著提升智能搜索代理的整体可靠性。
English
RL-based agentic search enables LLMs to solve complex questions via dynamic planning and external search. While this approach significantly enhances accuracy with agent policies optimized via large-scale reinforcement learning, we identify a critical gap in reliability: these agents fail to recognize their reasoning boundaries and rarely admit ``I DON'T KNOW'' (IDK) even when evidence is insufficient or reasoning reaches its limit. The lack of reliability often leads to plausible but unreliable answers, introducing significant risks in many real-world scenarios. To this end, we propose Boundary-Aware Policy Optimization (BAPO), a novel RL framework designed to cultivate reliable boundary awareness without compromising accuracy. BAPO introduces two key components: (i) a group-based boundary-aware reward that encourages an IDK response only when the reasoning reaches its limit, and (ii) an adaptive reward modulator that strategically suspends this reward during early exploration, preventing the model from exploiting IDK as a shortcut. Extensive experiments on four benchmarks demonstrate that BAPO substantially enhances the overall reliability of agentic search.
PDF122January 20, 2026