ChatPaper.aiChatPaper

基于过程奖励的截断式步骤级采样检索增强推理

Truncated Step-Level Sampling with Process Rewards for Retrieval-Augmented Reasoning

February 26, 2026
作者: Chris Samarinas, Haw-Shiuan Chang, Hamed Zamani
cs.AI

摘要

基於強化學習訓練大語言模型進行搜索引擎推理時,存在一個根本性的信用分配難題:現有方法(如Search-R1)僅在完整多步軌跡結束後提供稀疏結果獎勵,導致難以將成功或失敗歸因於具體的推理與檢索決策。過程獎勵方法(如StepSearch)通過引入步驟級監督緩解此問題,但仍依賴啟發式獎勵(如與標準文檔的TF-IDF重疊度),且每個樣本需採樣k條完整軌跡,梯度方差居高不下。我們提出SLATE框架,其基於兩個互補思想:(1)截斷式步驟級採樣——生成k條共享共同前綴、僅在下一步產生分歧的軌跡;(2)密集型LLM評判獎勵——用能力強大的LLM評估器替代啟發式評分,對每個推理步驟、搜索查詢及答案進行質量評估,提供更豐富可靠的監督。我們從理論上證明:在相同密集獎勵結構下,對於T步軌跡,截斷採樣可將優勢估計的方差較全軌跡採樣降低最多T倍,從而獲得方差更低、目標更明確的策略梯度。在七個問答基準上的實驗表明,SLATE始終優於稀疏獎勵與過程獎勵的基線方法,且在難度更高的多跳任務和小型模型上提升最為顯著。
English
Training large language models to reason with search engines via reinforcement learning is hindered by a fundamental credit assignment problem: existing methods such as Search-R1 provide only a sparse outcome reward after an entire multi-step trajectory, making it infeasible to attribute success or failure to individual reasoning and retrieval decisions. Process-reward methods like StepSearch alleviate this by introducing step-level supervision, but rely on heuristic rewards such as TF-IDF overlap with gold documents, and still sample k complete trajectories per example, retaining high gradient variance. We propose SLATE, a framework built on two complementary ideas: (1) truncated step-level sampling, which generates k trajectories that share a common prefix and differ only at the next step, and (2) dense LLM-as-judge rewards, which replace heuristic scoring with a capable LLM evaluator that assesses the quality of each reasoning step, search query, and answer, providing richer and more reliable supervision. We theoretically prove that under the same dense reward structure, truncated sampling reduces the variance of advantage estimates by up to a factor of T compared to full-trajectory sampling for T-step trajectories, yielding lower-variance, better-targeted policy gradients. Experiments on seven QA benchmarks confirm that SLATE consistently outperforms both sparse-reward and process-reward baselines, with the largest gains on harder multi-hop tasks and smaller models.
PDF32March 9, 2026