OPID: オン方策スキル蒸留によるエージェント的強化学習

要旨

結果ベースの強化学習は、言語エージェントに対して安定した最適化の基盤を提供するが、そのスパースな軌跡レベルの報酬は、どの中間決定を強化または抑制すべきかについての指針をほとんど提供しない。オンポリシー自己蒸留は密なトークンレベルの監督を提供するが、既存のスキル条件付き変種は、外部スキルメモリや検索された特権コンテキストに依存することが多く、これらは維持コストが高く、マルチターン対話において現在のポリシーによって誘導される状態分布と不一致になる可能性がある。我々は、完了したオンポリシー軌跡から直接スキル監督を抽出するフレームワークであるOPID（オンポリシースキル蒸留）を提案する。OPIDは軌跡の事後知見を階層的スキルとして表現する：エピソードレベルのスキルはグローバルなワークフローや失敗回避ルールを捉え、ステップレベルのスキルは重要なタイムステップにおける局所的な決定知識を捉える。重要優先ルーティング機構は、重要な決定が特定された場合にステップレベルのスキルを使用し、それ以外の場合にはデフォルトのガイダンスとしてエピソードレベルのスキルにフォールバックする。選択されたスキルは対話履歴に注入され、古いポリシーが元のコンテキストとスキル拡張コンテキストの両方で同じサンプリングされた応答を再スコアリングできるようにする。結果として得られる対数確率のシフトは、トークンレベルの自己蒸留アドバンテージを生み出し、それがポリシー最適化のために結果アドバンテージと組み合わされる。これによりOPIDは、RLを主要な訓練目的として維持しつつ、密で分布に一致した事後知見による監督を導入する。ALFWorld、WebShop、およびSearch-based QAでの実験は、OPIDが結果のみのRLや既存のスキル蒸留ベースラインと比較して、エージェントの性能、サンプル効率、およびロバスト性を一般的に改善することを示している。コードは https://github.com/jinyangwu/OPID/tree/main で入手可能である。

English

Outcome-based reinforcement learning provides a stable optimization backbone for language agents, but its sparse trajectory-level rewards provide little guidance on which intermediate decisions should be reinforced or suppressed. On-policy self-distillation offers dense token-level supervision, yet existing skill-conditioned variants often rely on external skill memories or retrieved privileged context, which are costly to maintain and can be mismatched with the state distribution induced by the current policy in multi-turn interaction. We propose OPID (On-Policy Skill Distillation), a framework that extracts skill supervision directly from completed on-policy trajectories. OPID represents trajectory hindsight as hierarchical skills: episode-level skills capture global workflows or failure-avoidance rules, while step-level skills capture local decision knowledge at critical timesteps. A critical-first routing mechanism uses step-level skills when critical decisions are identified and falls back to episode-level skills as default guidance otherwise. The selected skill is injected into the interaction history, allowing the old policy to re-score the same sampled response under both original and skill-augmented contexts. The resulting log-probability shift yields a token-level self-distillation advantage, which is combined with the outcome advantage for policy optimization. OPID thus preserves RL as the primary training objective while introducing dense, distribution-matched hindsight supervision. Experiments on ALFWorld, WebShop and Search-based QA demonstrate that OPID generally improves agent performance, sample efficiency, and robustness over outcome-only RL and existing skill-distillation baselines. Our code is available at https://github.com/jinyangwu/OPID/tree/main.