エージェント軌跡からの検索学習

要旨

情報検索（IR）システムは従来、人間のユーザー向けに設計・訓練されており、学習によるランキング手法はクリックや滞在時間といった大規模な人間のインタラクションログに大きく依存してきました。しかし、大規模言語モデル（LLM）を搭載した検索エージェントの急速な台頭により、検索結果を消費する主体は人間からエージェントへと移行しつつあり、検索はマルチターン推論と行動ループの中核コンポーネントとして組み込まれるようになってきました。このような環境下では、人間中心の仮定で訓練された検索モデルは、エージェントがクエリを発行し結果を消費する方法と根本的なミスマッチを生じます。本研究では、エージェント型検索のための検索モデルは、エージェントのインタラクションデータから直接訓練されるべきであると主張します。我々は、エージェント軌跡からの学習による検索という新たな訓練パラダイムを提案し、その教師信号は多段階のエージェント相互作用から導出されます。検索エージェントの軌跡を系統的に分析することで、文書の有用性を示す主要な行動信号（閲覧行動、未閲覧での棄却、閲覧後の推論痕跡など）を特定しました。これらの知見に基づき、我々はLRATを提案します。これはエージェント軌跡から高品質な検索の教師信号を抽出し、重み付き最適化を通じて関連性の強度を組み込む、シンプルかつ効果的なフレームワークです。ドメイン内およびドメイン外の深層研究ベンチマークにおける広範な実験により、LRATで訓練された検索器が、多様なエージェントアーキテクチャと規模において、証拠の再現率、エンドツーエンドのタスク成功率、実行効率を一貫して向上させることが実証されました。我々の結果は、エージェント軌跡が実用的でスケーラブルな教師信号源であることを示し、エージェント型検索時代の検索技術における有望な方向性を示唆しています。

English

Information retrieval (IR) systems have traditionally been designed and trained for human users, with learning-to-rank methods relying heavily on large-scale human interaction logs such as clicks and dwell time. With the rapid emergence of large language model (LLM) powered search agents, however, retrieval is increasingly consumed by agents rather than human beings, and is embedded as a core component within multi-turn reasoning and action loops. In this setting, retrieval models trained under human-centric assumptions exhibit a fundamental mismatch with the way agents issue queries and consume results. In this work, we argue that retrieval models for agentic search should be trained directly from agent interaction data. We introduce learning to retrieve from agent trajectories as a new training paradigm, where supervision is derived from multi-step agent interactions. Through a systematic analysis of search agent trajectories, we identify key behavioral signals that reveal document utility, including browsing actions, unbrowsed rejections, and post-browse reasoning traces. Guided by these insights, we propose LRAT, a simple yet effective framework that mines high-quality retrieval supervision from agent trajectories and incorporates relevance intensity through weighted optimization. Extensive experiments on both in-domain and out-of-domain deep research benchmarks demonstrate that retrievers trained with LRAT consistently improve evidence recall, end-to-end task success, and execution efficiency across diverse agent architectures and scales. Our results highlight agent trajectories as a practical and scalable supervision source, pointing to a promising direction for retrieval in the era of agentic search.

エージェント軌跡からの検索学習

Learning to Retrieve from Agent Trajectories

要旨

Support