PBSD: 特権的ベイズ自己蒸留による長期的信用割り当て

要旨

長期的なエージェントタスクは、結果ベースの強化学習において根本的なクレジット割り当ての課題を提起する。すなわち、軌跡レベルの報酬は最終的な正しさを検証するが、どの中間推論ステップやツール操作が結果に寄与したかについての限定的な指針しか提供しない。この困難さは、マルチターン検索エージェントにおいて特に顕著であり、成功した軌跡には誤解を招く行動が含まれる可能性があり、失敗した軌跡には価値ある情報収集ステップが含まれる可能性がある。本稿では、疎な最終報酬の下での細粒度なクレジット割り当てのための、ベイズ較正された自己蒸留手法であるPBSD（Privileged Bayesian Self-Distillation）を提案する。PBSDは、検証された回答の事後対事前確率比を通じて軌跡の品質を測定し、ベイズの定理を適用することで、この推定が困難な回答側の比を、標準的な生徒モデルと特権的な回答条件付き教師モデル間の扱いやすい尤度比に変換する。このベイズ的証拠スコアの自己回帰分解により、各中間ターンが検証された結果を支持するか損なうかを識別するターンレベルのシグナルが得られる。その結果、PBSDは、疎な結果監督をベイズ較正されたターンレベルのクレジットシグナルに変換する原理的かつエレガントな再重み付けスキームを提供し、標準的な方策最適化と完全に互換性を保つ。実験により、PBSDはドメイン内およびドメイン外の両方の設定で一貫して性能を向上させ、短コンテキスト学習から長コンテキスト推論への知識移転を効果的に促進することが示され、その細粒度なクレジット割り当て機構がより効果的な方策学習を促進し、改善された汎化をもたらすことが示唆される。

English

Long-horizon agentic tasks pose a fundamental credit assignment challenge for outcome-base reinforcement learning: trajectory-level rewards verify final correctness but provide limited guidance on which intermediate reasoning steps or tool interactions contribute to the outcome. The difficulty is especially pronounced in multi-turn search agents, where successful trajectories may contain misleading actions and failed trajectories may contain valuable evidence-gathering steps. We propose PBSD (Privileged Bayesian Self-Distillation), a Bayes-calibrated self-distillation method for fine-grained credit assignment under sparse final rewards. PBSD measures trajectory quality through the posterior-to-prior probability ratio of the verified answer and applies Bayes' rule to convert this hard-to-estimate answer-side ratio into a tractable likelihood ratio between a standard student model and a privileged answer-conditioned teacher model. Autoregressive decomposition of this Bayesian evidence score yields turn-level signals that identify whether each intermediate turn supports or undermines the verified outcome. Consequently, PBSD provides a principled and elegant reweighting scheme that transforms sparse outcome supervision into Bayes-calibrated turn-level credit signals, while remaining fully compatible with standard policy optimization. Experiments demonstrate that PBSD consistently enhances performance across both in-domain and out-of-domain settings, and effectively transfers knowledge from short-context training to long-context inference, suggesting that its fine-grained credit assignment mechanism facilitates more effective policy learning and yields improved generalization.