PBSD: 장기 신용 할당을 위한 특권적 베이지안 자기 증류

초록

장기적 에이전트 작업은 결과 기반 강화 학습에 근본적인 신용 할당 문제를 제기한다: 궤적 수준의 보상은 최종 정확성을 검증하지만, 중간 추론 단계나 도구 상호작용 중 어떤 것이 결과에 기여했는지에 대한 정보는 제한적이다. 이러한 어려움은 특히 다중 회차 탐색 에이전트에서 두드러지는데, 성공적인 궤적이라도 오해의 소지가 있는 행동을 포함할 수 있고, 실패한 궤적이라도 가치 있는 증거 수집 단계를 포함할 수 있기 때문이다. 본 논문에서는 희소한 최종 보상 하에서 세분화된 신용 할당을 위한 베이즈 보정 자기 증류 방법인 PBSD(Privileged Bayesian Self-Distillation)를 제안한다. PBSD는 검증된 정답의 사후 대 사전 확률 비율을 통해 궤적 품질을 측정하고, 베이즈 규칙을 적용하여 추정이 어려운 이 정답 측 비율을 표준 학생 모델과 특권을 가진 정답 조건부 교사 모델 간의 다루기 쉬운 우도 비율로 변환한다. 이 베이즈 증거 점수를 자기회귀적으로 분해하면 각 중간 회차가 검증된 결과를 지지하는지 약화시키는지를 식별하는 회차 수준 신호를 얻을 수 있다. 결과적으로 PBSD는 희소한 결과 감독을 베이즈 보정된 회차 수준 신용 신호로 변환하는 원칙적이고 우아한 재가중치 부여 방식을 제공하면서도, 표준 정책 최적화와 완전히 호환된다. 실험 결과는 PBSD가 동일 도메인 및 도메인 외부 설정 모두에서 일관되게 성능을 향상시키며, 짧은 맥락 훈련에서 얻은 지식을 긴 맥락 추론으로 효과적으로 전이함을 보여준다. 이는 PBSD의 세분화된 신용 할당 메커니즘이 더 효과적인 정책 학습을 촉진하고 개선된 일반화를 이끌어냄을 시사한다.

English

Long-horizon agentic tasks pose a fundamental credit assignment challenge for outcome-base reinforcement learning: trajectory-level rewards verify final correctness but provide limited guidance on which intermediate reasoning steps or tool interactions contribute to the outcome. The difficulty is especially pronounced in multi-turn search agents, where successful trajectories may contain misleading actions and failed trajectories may contain valuable evidence-gathering steps. We propose PBSD (Privileged Bayesian Self-Distillation), a Bayes-calibrated self-distillation method for fine-grained credit assignment under sparse final rewards. PBSD measures trajectory quality through the posterior-to-prior probability ratio of the verified answer and applies Bayes' rule to convert this hard-to-estimate answer-side ratio into a tractable likelihood ratio between a standard student model and a privileged answer-conditioned teacher model. Autoregressive decomposition of this Bayesian evidence score yields turn-level signals that identify whether each intermediate turn supports or undermines the verified outcome. Consequently, PBSD provides a principled and elegant reweighting scheme that transforms sparse outcome supervision into Bayes-calibrated turn-level credit signals, while remaining fully compatible with standard policy optimization. Experiments demonstrate that PBSD consistently enhances performance across both in-domain and out-of-domain settings, and effectively transfers knowledge from short-context training to long-context inference, suggesting that its fine-grained credit assignment mechanism facilitates more effective policy learning and yields improved generalization.