검색 증강 추론을 위한 프로세스 보상과 단계 수준 절단 샘플링

초록

대규모 언어 모델이 검색 엔진을 활용하여 추론하도록 강화 학습으로 훈련시키는 것은 근본적인 크레딧 할당 문제로 인해 어려움을 겪고 있습니다. Search-R1과 같은 기존 방법은 다단계 트레이젝토리 전체가 끝난 후 희소한 결과 보상만을 제공하여, 성공 또는 실패를 개별 추론 및 검색 결정에 귀속시키는 것을 불가능하게 만듭니다. StepSearch와 같은 과정 보상 방법은 단계별 감독을 도입하여 이 문제를 완화하지만, 정답 문서와의 TF-IDF 중첩과 같은 휴리스틱 보상에 의존하며, 여전히 예제당 k개의 완전한 트레이젝토리를 샘플링하여 높은 그래디언트 분산을 유지합니다. 우리는 두 가지 상호 보완적인 아이디어에 기반한 SLATE 프레임워크를 제안합니다: (1) 공통 접두사를 공유하고 다음 단계에서만 차이가 나는 k개의 트레이젝토리를 생성하는 절단 단계별 샘플링과 (2) 휴리스틱 점수화를 대체하여 각 추론 단계, 검색 쿼리, 답변의 질을 평가하는 능력 있는 LLM 평가자로 구성된 조밀한 LLM-as-judge 보상입니다. 이는 더 풍부하고 신뢰할 수 있는 감독을 제공합니다. 우리는 동일한 조밀 보상 구조 하에서 절단 샘플링이 T단계 트레이젝토리에 대해 전체 트레이젝토리 샘플링 대비 이점 추정치의 분산을 최대 T배까지 감소시켜 더 낮은 분산과 더 잘 표적화된 정책 그래디언트를 생성함을 이론적으로 증명합니다. 7개의 QA 벤치마크에 대한 실험은 SLATE가 희소 보상 및 과정 보상 기준선을 모두 꾸준히 능가하며, 특히 더 어려운 다중 홉 작업과 더 작은 모델에서 가장 큰 성능 향상을 보임을 확인합니다.

English

Training large language models to reason with search engines via reinforcement learning is hindered by a fundamental credit assignment problem: existing methods such as Search-R1 provide only a sparse outcome reward after an entire multi-step trajectory, making it infeasible to attribute success or failure to individual reasoning and retrieval decisions. Process-reward methods like StepSearch alleviate this by introducing step-level supervision, but rely on heuristic rewards such as TF-IDF overlap with gold documents, and still sample k complete trajectories per example, retaining high gradient variance. We propose SLATE, a framework built on two complementary ideas: (1) truncated step-level sampling, which generates k trajectories that share a common prefix and differ only at the next step, and (2) dense LLM-as-judge rewards, which replace heuristic scoring with a capable LLM evaluator that assesses the quality of each reasoning step, search query, and answer, providing richer and more reliable supervision. We theoretically prove that under the same dense reward structure, truncated sampling reduces the variance of advantage estimates by up to a factor of T compared to full-trajectory sampling for T-step trajectories, yielding lower-variance, better-targeted policy gradients. Experiments on seven QA benchmarks confirm that SLATE consistently outperforms both sparse-reward and process-reward baselines, with the largest gains on harder multi-hop tasks and smaller models.

검색 증강 추론을 위한 프로세스 보상과 단계 수준 절단 샘플링

Truncated Step-Level Sampling with Process Rewards for Retrieval-Augmented Reasoning

초록

Support