Critic-R: 자연어 내성적 피드백을 활용한 명령어 튜닝 검색기를 통한 에이전트 검색 향상

초록

에이전트 검색 시스템은 복잡한 질의를 처리하기 위해 검색 모델과 반복적으로 상호작용한다. 상당한 진전이 있었음에도 불구하고, 에이전트 검색을 위한 검색 최적화는 여전히 어려운 과제로 남아 있으며, 실제 적용 가능성을 제한하는 과도한 공동 학습이나 금본위 주석이 종종 요구된다. 본 논문에서는 추론 과정과 검색 모델 간의 피드백 루프를 추론 및 학습 과정 모두에서 명시적으로 폐쇄하는 프레임워크인 Critic-R을 제안한다. Critic-R은 검색된 증거를 소비한 후 에이전트의 내성적 추론 과정을 평가하여, 검색된 문맥이 다음 추론 단계를 충분히 지원하는지 판단하는 비평 모델을 도입한다. Critic-R은 두 가지 상호 보완적 메커니즘을 갖는다: Critic-R-Zero는 추론 시 질의 정제 루프로, 질의와 검색 명령을 반복적으로 재작성하며, Critic-Embed는 수동 관련성 주석 없이 성공 및 실패 정제 궤적을 자동 감독으로 활용하여 검색 모델을 최적화하는 접근법이다. 우리는 Critic-R을 HotpotQA, 2WikiMultihopQA, MuSiQue, Bamboogle 데이터셋에서 평가한다. 실험 결과는 Critic-R이 검색 품질과 최종 답변 정확도를 모두 유의미하게 향상시킴을 보여준다.

English

Agentic search systems iteratively interact with retrieval models to answer complex queries. Despite substantial progress, optimizing retrievers for agentic search remains challenging, often requiring heavy co-training or gold-standard annotations that limit real-world applicability. We propose Critic-R, a framework that explicitly closes the feedback loop between the reasoning agent and the retrieval model during both inference and training. Critic-R introduces a critic model that evaluates the agent's introspective reasoning trace after consuming retrieved evidence to determine whether the retrieved context sufficiently supports the next reasoning step. Critic-R has two complementary mechanisms: Critic-R-Zero, an inference-time query refinement loop that iteratively rewrites queries and retrieval instructions, and Critic-Embed, an optimization approach for retrieval models that leverages successful and failed refinement trajectories as automatic supervision without requiring manual relevance annotation. We evaluate Critic-R on HotpotQA, 2WikiMultihopQA, MuSiQue, and Bamboogle. Results show that Critic-R significantly improves both retrieval quality and downstream answer accuracy.