SlimSearcher: 적응적 보상 게이팅을 통한 훈련 효율성 인지 웹 에이전트

초록

심층 연구 에이전트는 복잡한 정보 탐색 작업에서 뛰어난 능력을 입증해 왔으나, 이러한 강력함은 상당한 계산 비용을 수반한다. 정확성 중심의 학습 패러다임에 의해 추진된 현재 모델들은 맹목적인 도구 의존성과 실연적 추론(performative reasoning), 즉 작업 해결에 불필요하게 긴 중복 궤적을 생성함으로써 낭비적인 도구 호출과 과도한 토큰 소비를 초래하는 무차별적 전략을 채택한다. 이러한 효율성 함정을 극복하기 위해, 본 논문에서는 지도 미세 조정(SFT)과 강화 학습(RL) 모두에서 정확성과 계산 비용 간의 파레토 최적 경계를 확장하는 원칙적인 프레임워크인 SlimSearcher를 제안한다. SFT 단계에서 SlimSearcher는 파레토 효율적 필터링(Pareto-efficient filtration)을 활용하여 성공적이면서도 경제적인 궤적을 추출함으로써, 모델이 본질적으로 효율성을 고려한 탐색 행동을 취하도록 유도한다. RL 단계에서는 샘플링된 코호트 내에서 상대적인 도구 및 토큰 효율성을 평가하는 동적 보상 형태 메커니즘인 적응형 보상 게이팅(Adaptive Reward Gating)을 도입한다. 이러한 적응형 효율성 지표를 엄격한 정확성 게이트와 계단식으로 연결함으로써, 본 접근법은 절대적 패널티와 관련된 간결성 편향을 효과적으로 회피하고 보상 해킹(reward hacking)을 완화한다. GAIA, BrowseComp, XBenchDeepSearch를 포함한 장기 과제 벤치마크에 대한 광범위한 실험 결과, SlimSearcher는 정확성을 유지하거나 개선하면서 평균 도구 호출 횟수를 17%-58% 감소시키는 것을 입증한다.

English

Deep research agents have demonstrated remarkable capabilities in complex information-seeking tasks, yet this power comes at a steep computational cost. Driven by accuracy-focused training paradigms, current models adopt brute-force strategies characterized by blind tool dependency and performative reasoning-generating long, redundant trajectories that are far from necessary for resolving these tasks, leading to wasteful tool calls and excessive token consumption. To overcome this efficiency trap, we propose SlimSearcher, a principled framework that pushes the Pareto frontier between accuracy and computational cost across both Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). In the SFT stage, SlimSearcher employs Pareto-efficient filtration to distill trajectories that are both successful and economical, guiding the model toward inherently efficiency-aware search behaviors. During RL, we introduce Adaptive Reward Gating, a dynamic reward-shaping mechanism that evaluates relative tool and token efficiency within a sampled cohort. By cascading these adaptive efficiency metrics with a strict correctness gate, our approach effectively avoids the brevity bias associated with absolute penalties and mitigates reward hacking. Extensive experiments on long-horizon benchmarks, including GAIA, BrowseComp, and XBenchDeepSearch, demonstrate that SlimSearcher reduces average tool-call rounds by 17%-58% while maintaining or improving accuracy.