Critic-R：利用指令微调检索器与自然语言内省反馈改进智能体搜索

摘要

自主搜索系统通过迭代与检索模型交互来回答复杂查询。尽管取得了显著进展，但优化检索模型以适应自主搜索仍颇具挑战，往往需要大量联合训练或黄金标准标注，这限制了其实际应用。我们提出Critic-R框架，该框架在推理和训练阶段明确地构建了推理智能体与检索模型之间的反馈闭环。Critic-R引入了一个评判模型，该模型在获取检索到的证据后，评估智能体的内省推理轨迹，以判断当前检索到的上下文是否足以支撑下一步推理。Critic-R包含两种互补机制：Critic-R-Zero是一种推理时查询优化循环，通过迭代重写查询和检索指令；Critic-Embed则是一种检索模型优化方法，利用成功与失败的优化轨迹作为自动监督信号，无需人工相关性标注。我们在HotpotQA、2WikiMultihopQA、MuSiQue和Bamboogle数据集上对Critic-R进行了评估。结果表明，Critic-R显著提升了检索质量和下游答案准确率。

English

Agentic search systems iteratively interact with retrieval models to answer complex queries. Despite substantial progress, optimizing retrievers for agentic search remains challenging, often requiring heavy co-training or gold-standard annotations that limit real-world applicability. We propose Critic-R, a framework that explicitly closes the feedback loop between the reasoning agent and the retrieval model during both inference and training. Critic-R introduces a critic model that evaluates the agent's introspective reasoning trace after consuming retrieved evidence to determine whether the retrieved context sufficiently supports the next reasoning step. Critic-R has two complementary mechanisms: Critic-R-Zero, an inference-time query refinement loop that iteratively rewrites queries and retrieval instructions, and Critic-Embed, an optimization approach for retrieval models that leverages successful and failed refinement trajectories as automatic supervision without requiring manual relevance annotation. We evaluate Critic-R on HotpotQA, 2WikiMultihopQA, MuSiQue, and Bamboogle. Results show that Critic-R significantly improves both retrieval quality and downstream answer accuracy.