에이전트 궤적에서의 검색 학습

초록

정보 검색(IR) 시스템은 전통적으로 인간 사용자를 위해 설계되고 훈련되어 왔으며, 학습 순위 결정(learning-to-rank) 방법은 클릭 및 체류 시간과 같은 대규모 인간 상호작용 로그에 크게 의존해왔습니다. 그러나 대규모 언어 모델(LLM) 기반 검색 에이전트의 급속한 등장으로, 검색은 점점 더 인간이 아닌 에이전트에 의해 소비되며, 다중 회차 추론 및 행동 루프 내에서 핵심 구성 요소로 내장되고 있습니다. 이러한 환경에서 인간 중심 가정 하에 훈련된 검색 모델은 에이전트가 쿼리를 발행하고 결과를 소비하는 방식과 근본적인 불일치를 보입니다. 본 연구에서는 에이전트 검색을 위한 검색 모델이 에이전트 상호작용 데이터로부터 직접 훈련되어야 한다고 주장합니다. 우리는 에이전트 궤적로부터의 학습 검색(Learning to Retrieve from Agent Trajectories, LRAT)을 새로운 훈련 패러다임으로 소개하며, 여기서 지도 신호는 다단계 에이전트 상호작용에서 도출됩니다. 검색 에이전트 궤적에 대한 체계적인 분석을 통해, 문서 유용성을 드러내는 핵심 행동 신호(탐색 행동, 미탐색 기각, 탐색 후 추론 흔적 등)를 식별합니다. 이러한 통찰을 바탕으로, 우리는 에이전트 궤적에서 고품질 검색 지도 신호를 추출하고 가중 최적화를 통해 관련성 강도를 통합하는 간단하지만 효과적인 프레임워크인 LRAT를 제안합니다. 인-도메인 및 아웃-오브-도메인 심층 연구 벤치마크에서의 폭넓은 실험을 통해, LRAT로 훈련된 검색기가 다양한 에이전트 아키텍처와 규모에 걸쳐 증거 재현율, 종단간 작업 성공률 및 실행 효율성을 지속적으로 향상시킴을 입증합니다. 우리의 결과는 에이전트 궤적이 실용적이고 확장 가능한 지도 신호원임을 강조하며, 에이전트 검색 시대의 검색을 위한 유망한 방향을 제시합니다.

English

Information retrieval (IR) systems have traditionally been designed and trained for human users, with learning-to-rank methods relying heavily on large-scale human interaction logs such as clicks and dwell time. With the rapid emergence of large language model (LLM) powered search agents, however, retrieval is increasingly consumed by agents rather than human beings, and is embedded as a core component within multi-turn reasoning and action loops. In this setting, retrieval models trained under human-centric assumptions exhibit a fundamental mismatch with the way agents issue queries and consume results. In this work, we argue that retrieval models for agentic search should be trained directly from agent interaction data. We introduce learning to retrieve from agent trajectories as a new training paradigm, where supervision is derived from multi-step agent interactions. Through a systematic analysis of search agent trajectories, we identify key behavioral signals that reveal document utility, including browsing actions, unbrowsed rejections, and post-browse reasoning traces. Guided by these insights, we propose LRAT, a simple yet effective framework that mines high-quality retrieval supervision from agent trajectories and incorporates relevance intensity through weighted optimization. Extensive experiments on both in-domain and out-of-domain deep research benchmarks demonstrate that retrievers trained with LRAT consistently improve evidence recall, end-to-end task success, and execution efficiency across diverse agent architectures and scales. Our results highlight agent trajectories as a practical and scalable supervision source, pointing to a promising direction for retrieval in the era of agentic search.