PBSD：特權貝葉斯自蒸餾應用於長期信用分配

摘要

長程智能體任務對基於結果的強化學習構成了根本性的信用分配挑戰：軌跡層級的獎勵僅驗證最終正確性，卻無法提供足夠的指引，區分哪些中間推理步驟或工具互動促成了最終結果。此困難在多輪搜尋智能體中尤為明顯，因為成功的軌跡可能包含誤導性動作，而失敗的軌跡則可能包含有價值的證據蒐集步驟。我們提出PBSD（特權貝氏自我蒸餾），一種在稀疏最終獎勵下進行細粒度信用分配的貝氏校準自我蒸餾方法。PBSD透過驗證答案的後驗機率與先驗機率比率來衡量軌跡品質，並運用貝氏定理將此難以估計的答案端比率，轉換為標準學生模型與特權答案條件教師模型之間的可處理似然比率。對此貝氏證據分數進行自迴歸分解，可產生輪次層級的訊號，藉以識別每個中間輪次是支持還是削弱了驗證結果。因此，PBSD提供一個原則性且優雅的重新加權機制，將稀疏的結果監督轉化為貝氏校準的輪次層級信用訊號，同時完全相容於標準的策略最佳化。實驗結果顯示，PBSD在域內與域外設定中一致地提升了效能，並有效地將短上下文訓練的知識遷移至長上下文推理中，這表明其細粒度的信用分配機制有助於更有效的策略學習，並帶來更佳的泛化能力。

English

Long-horizon agentic tasks pose a fundamental credit assignment challenge for outcome-base reinforcement learning: trajectory-level rewards verify final correctness but provide limited guidance on which intermediate reasoning steps or tool interactions contribute to the outcome. The difficulty is especially pronounced in multi-turn search agents, where successful trajectories may contain misleading actions and failed trajectories may contain valuable evidence-gathering steps. We propose PBSD (Privileged Bayesian Self-Distillation), a Bayes-calibrated self-distillation method for fine-grained credit assignment under sparse final rewards. PBSD measures trajectory quality through the posterior-to-prior probability ratio of the verified answer and applies Bayes' rule to convert this hard-to-estimate answer-side ratio into a tractable likelihood ratio between a standard student model and a privileged answer-conditioned teacher model. Autoregressive decomposition of this Bayesian evidence score yields turn-level signals that identify whether each intermediate turn supports or undermines the verified outcome. Consequently, PBSD provides a principled and elegant reweighting scheme that transforms sparse outcome supervision into Bayes-calibrated turn-level credit signals, while remaining fully compatible with standard policy optimization. Experiments demonstrate that PBSD consistently enhances performance across both in-domain and out-of-domain settings, and effectively transfers knowledge from short-context training to long-context inference, suggesting that its fine-grained credit assignment mechanism facilitates more effective policy learning and yields improved generalization.