SlimSearcher:通过自适应奖励门控训练效率感知的Web代理
SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating
June 5, 2026
作者: Zequn Xie, Junjie Wang, Dan Yang, Jie Feng, Yue Shen, Jian Wang, Jinjie Gu
cs.AI
摘要
深度研究智能体在复杂信息检索任务中展现了卓越能力,但这种能力也伴随着高昂的计算成本。受准确性优先的训练范式驱动,当前模型采用暴力策略,表现为盲目依赖工具和表演性推理——生成长而冗余的轨迹,这些轨迹与解决任务所需相去甚远,导致无效的工具调用和过度的令牌消耗。为克服这一效率陷阱,我们提出SlimSearcher,这是一个原则性框架,在监督微调(SFT)和强化学习(RL)两个阶段同时推动准确性与计算成本之间的帕累托边界。在SFT阶段,SlimSearcher采用帕累托高效过滤法,筛选出既成功又经济的轨迹,引导模型形成内在的效率感知搜索行为。在RL阶段,我们引入自适应奖励门控,这是一种动态奖励塑造机制,在采样组内评估工具和令牌的相对效率。通过将这些自适应效率指标与严格正确性门控级联,我们的方法有效避免了与绝对惩罚相关的简洁性偏差,并缓解了奖励破解问题。在GAIA、BrowseComp和XBenchDeepSearch等长周期基准上的广泛实验表明,SlimSearcher在保持或提升准确性的同时,将平均工具调用轮次减少了17%至58%。
English
Deep research agents have demonstrated remarkable capabilities in complex information-seeking tasks, yet this power comes at a steep computational cost. Driven by accuracy-focused training paradigms, current models adopt brute-force strategies characterized by blind tool dependency and performative reasoning-generating long, redundant trajectories that are far from necessary for resolving these tasks, leading to wasteful tool calls and excessive token consumption. To overcome this efficiency trap, we propose SlimSearcher, a principled framework that pushes the Pareto frontier between accuracy and computational cost across both Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). In the SFT stage, SlimSearcher employs Pareto-efficient filtration to distill trajectories that are both successful and economical, guiding the model toward inherently efficiency-aware search behaviors. During RL, we introduce Adaptive Reward Gating, a dynamic reward-shaping mechanism that evaluates relative tool and token efficiency within a sampled cohort. By cascading these adaptive efficiency metrics with a strict correctness gate, our approach effectively avoids the brevity bias associated with absolute penalties and mitigates reward hacking. Extensive experiments on long-horizon benchmarks, including GAIA, BrowseComp, and XBenchDeepSearch, demonstrate that SlimSearcher reduces average tool-call rounds by 17%-58% while maintaining or improving accuracy.