HINT-SD: 長期視野エージェントのための標的型後知恵自己蒸留

要旨

強化学習を用いた長期的視野を持つLLMエージェントの訓練は困難である。なぜなら、疎な結果報酬はタスクの成功可否を示すものの、どの中間行動がその結果を引き起こしたか、またそれらをどのように修正すべきかは明らかにしないからである。最近の手法では、ターンレベルの行動出力信号から報酬やテキストヒントを生成するか、フィードバック条件付き自己蒸留を用いることでこの問題を軽減している。しかし、中間のターンの多くが既に成功または中立である場合に毎ターンフィードバックを生成するのは非効率であり、固定されたまたは不適切なタイミングでフィードバックを適用すると、失敗に寄与した行動を監督できないことが多い。このギャップを埋めるために、我々はHINT-SDを提案する。これはターゲットを絞った自己蒸留フレームワークであり、全軌跡のハインドサイトを用いて失敗関連行動を選択し、フィードバック条件付き蒸留を選択した行動スパンにのみ適用する。BFCL v3およびAppWorldでの実験により、我々の手法は密な毎ターンフィードバックベースラインと比較して最大18.80%の改善を達成し、同時に訓練ステップあたりの時間を2.26倍削減した。これにより、蒸留対象の選択が効果的かつ効率的な長期的エージェント訓練の鍵であることが示唆される。

English

Training long-horizon LLM agents with reinforcement learning is challenging because sparse outcome rewards reveal whether a task succeeds, but not which intermediate actions caused the outcome or how they should be corrected. Recent methods alleviate this issue by generating rewards or textual hints from turn-level action-output signals, or by using feedback-conditioned self-distillation. However, generating feedback at every turn is inefficient when many intermediate turns are already successful or neutral, and applying feedback at a fixed or misaligned turn often fails to supervise the actions that contributed to the failure. To bridge this gap, we propose HINT-SD, a targeted self-distillation framework that uses full-trajectory hindsight to select failure-relevant actions and applies feedback-conditioned distillation only on targeted action spans. Experiments on BFCL v3 and AppWorld show that our method improves over the dense per-turn feedback baseline by up to 18.80 percent while achieving 2.26times lower time per training step, suggesting that selecting where to distill is a key factor for both effective and efficient long-horizon agent training.