基于过程奖励的截断式步骤级采样检索增强推理

摘要

通过强化学习训练大语言模型使用搜索引擎进行推理时，存在一个根本性的信用分配难题：现有方法（如Search-R1）仅在完整多步轨迹结束后提供稀疏的结果奖励，导致难以将成功或失败归因于具体的推理和检索决策。过程奖励方法（如StepSearch）通过引入步骤级监督缓解了这一问题，但仍依赖启发式奖励（如与标准文档的TF-IDF重叠度），且每个样本仍需采样k条完整轨迹，梯度方差依然较高。我们提出SLATE框架，其核心包含两个互补思想：（1）截断式步骤级采样——生成k条共享共同前缀、仅在下个步骤产生分化的轨迹；（2）密集的LLM评判奖励——用能力强大的LLM评估器替代启发式评分，该评估器对每个推理步骤、搜索查询和答案进行质量评估，提供更丰富可靠的监督。我们理论证明在相同密集奖励结构下，对于T步轨迹，截断采样相比完整轨迹采样能将优势估计的方差降低最多T倍，从而获得方差更小、目标更明确的策略梯度。在七个问答基准测试上的实验表明，SLATE始终优于稀疏奖励和过程奖励基线方法，且在难度更高的多跳任务和小规模模型上提升最为显著。

English

Training large language models to reason with search engines via reinforcement learning is hindered by a fundamental credit assignment problem: existing methods such as Search-R1 provide only a sparse outcome reward after an entire multi-step trajectory, making it infeasible to attribute success or failure to individual reasoning and retrieval decisions. Process-reward methods like StepSearch alleviate this by introducing step-level supervision, but rely on heuristic rewards such as TF-IDF overlap with gold documents, and still sample k complete trajectories per example, retaining high gradient variance. We propose SLATE, a framework built on two complementary ideas: (1) truncated step-level sampling, which generates k trajectories that share a common prefix and differ only at the next step, and (2) dense LLM-as-judge rewards, which replace heuristic scoring with a capable LLM evaluator that assesses the quality of each reasoning step, search query, and answer, providing richer and more reliable supervision. We theoretically prove that under the same dense reward structure, truncated sampling reduces the variance of advantage estimates by up to a factor of T compared to full-trajectory sampling for T-step trajectories, yielding lower-variance, better-targeted policy gradients. Experiments on seven QA benchmarks confirm that SLATE consistently outperforms both sparse-reward and process-reward baselines, with the largest gains on harder multi-hop tasks and smaller models.