Critic-R：自然言語内省的フィードバックを用いた命令調整型検索器によるエージェント検索の改善

要旨

エージェント型検索システムは、複雑なクエリに回答するために検索モデルと反復的に相互作用する。大きな進展があったものの、エージェント型検索における検索モデルの最適化は依然として困難であり、多くの場合、大規模な共学習やゴールドスタンダードアノテーションを必要とし、実世界での適用可能性が制限される。本稿では、推論時および学習時の両方において、推論エージェントと検索モデル間のフィードバックループを明示的に閉じるフレームワークであるCritic-Rを提案する。Critic-Rは、検索された証拠を参照した後にエージェントの内省的推論過程を評価し、その検索コンテキストが次の推論ステップを十分にサポートするかどうかを判断する批評モデルを導入する。Critic-Rには二つの相補的なメカニズムがある。すなわち、推論時にクエリと検索指示を反復的に書き換えるクエリ洗練ループであるCritic-R-Zeroと、手動の関連性アノテーションを必要とせずに成功および失敗した洗練の軌跡を自動的な監督として活用する検索モデルの最適化手法であるCritic-Embedである。我々はCritic-RをHotpotQA、2WikiMultihopQA、MuSiQue、Bamboogleで評価した。結果は、Critic-Rが検索品質と下流の解答精度の両方を大幅に改善することを示している。

English

Agentic search systems iteratively interact with retrieval models to answer complex queries. Despite substantial progress, optimizing retrievers for agentic search remains challenging, often requiring heavy co-training or gold-standard annotations that limit real-world applicability. We propose Critic-R, a framework that explicitly closes the feedback loop between the reasoning agent and the retrieval model during both inference and training. Critic-R introduces a critic model that evaluates the agent's introspective reasoning trace after consuming retrieved evidence to determine whether the retrieved context sufficiently supports the next reasoning step. Critic-R has two complementary mechanisms: Critic-R-Zero, an inference-time query refinement loop that iteratively rewrites queries and retrieval instructions, and Critic-Embed, an optimization approach for retrieval models that leverages successful and failed refinement trajectories as automatic supervision without requiring manual relevance annotation. We evaluate Critic-R on HotpotQA, 2WikiMultihopQA, MuSiQue, and Bamboogle. Results show that Critic-R significantly improves both retrieval quality and downstream answer accuracy.