ChatPaper.aiChatPaper

搜索-R2:通过执行者-精炼者协作增强搜索集成推理能力

Search-R2: Enhancing Search-Integrated Reasoning via Actor-Refiner Collaboration

February 3, 2026
作者: Bowei He, Minda Hu, Zenan Xu, Hongru Wang, Licheng Zong, Yankai Chen, Chen Ma, Xue Liu, Pluto Zhou, Irwin King
cs.AI

摘要

搜索集成推理使语言智能体能够通过主动查询外部源来超越静态参数化知识。然而,通过强化学习训练这些智能体面临多尺度信用分配问题:现有方法通常依赖稀疏的轨迹级奖励,无法区分高质量推理与偶然猜测,导致冗余或误导性的搜索行为。为此,我们提出Search-R2——一种新颖的“执行者-优化器”协作框架,通过定向干预增强推理能力,两个组件在训练过程中联合优化。该框架将生成过程分解为执行者(生成初始推理轨迹)和元优化器(通过“截断-再生”机制选择性诊断并修复缺陷步骤)。为提供细粒度监督,我们设计了混合奖励机制,将结果正确性与量化检索证据信息密度的密集过程奖励相结合。理论层面,我们将执行者-优化器交互形式化为平滑混合策略,证明选择性修正相较强基线能带来严格性能提升。在多种通用及多跳问答数据集上的实验表明,Search-R2在不同模型规模下均持续优于基于RAG和强化学习的强基线,以最小开销实现更优的推理准确率。
English
Search-integrated reasoning enables language agents to transcend static parametric knowledge by actively querying external sources. However, training these agents via reinforcement learning is hindered by the multi-scale credit assignment problem: existing methods typically rely on sparse, trajectory-level rewards that fail to distinguish between high-quality reasoning and fortuitous guesses, leading to redundant or misleading search behaviors. To address this, we propose Search-R2, a novel Actor-Refiner collaboration framework that enhances reasoning through targeted intervention, with both components jointly optimized during training. Our approach decomposes the generation process into an Actor, which produces initial reasoning trajectories, and a Meta-Refiner, which selectively diagnoses and repairs flawed steps via a 'cut-and-regenerate' mechanism. To provide fine-grained supervision, we introduce a hybrid reward design that couples outcome correctness with a dense process reward quantifying the information density of retrieved evidence. Theoretically, we formalize the Actor-Refiner interaction as a smoothed mixture policy, proving that selective correction yields strict performance gains over strong baselines. Extensive experiments across various general and multi-hop QA datasets demonstrate that Search-R2 consistently outperforms strong RAG and RL-based baselines across model scales, achieving superior reasoning accuracy with minimal overhead.
PDF51February 5, 2026