HINT-SD：面向长时域智能体的定向事后自我蒸馏

摘要

使用强化学习训练长程大语言模型代理面临挑战，原因在于稀疏的结局奖励虽能揭示任务是否成功，却无法指明哪些中间动作导致了该结果，也不知应如何修正这些动作。近期方法通过从回合级动作-输出信号生成奖励或文本提示，或利用基于反馈条件的自我蒸馏来缓解这一问题。然而，当大量中间回合已成功或为中性时，在每个回合生成反馈效率低下；而在固定或错位的回合应用反馈，则往往无法监督导致失败的动作。为弥补这一不足，我们提出HINT-SD，一种基于全轨迹事后回顾的定向自我蒸馏框架，用于筛选与失败相关的动作，并仅对目标动作片段进行反馈条件蒸馏。在BFCL v3和AppWorld上的实验表明，我们的方法相较于密集逐回合反馈基线，性能提升高达18.80%，同时每个训练步骤的时间降低了2.26倍，这揭示了选择蒸馏位置是实现长程智能体训练高效性与有效性的关键因素。

English

Training long-horizon LLM agents with reinforcement learning is challenging because sparse outcome rewards reveal whether a task succeeds, but not which intermediate actions caused the outcome or how they should be corrected. Recent methods alleviate this issue by generating rewards or textual hints from turn-level action-output signals, or by using feedback-conditioned self-distillation. However, generating feedback at every turn is inefficient when many intermediate turns are already successful or neutral, and applying feedback at a fixed or misaligned turn often fails to supervise the actions that contributed to the failure. To bridge this gap, we propose HINT-SD, a targeted self-distillation framework that uses full-trajectory hindsight to select failure-relevant actions and applies feedback-conditioned distillation only on targeted action spans. Experiments on BFCL v3 and AppWorld show that our method improves over the dense per-turn feedback baseline by up to 18.80 percent while achieving 2.26times lower time per training step, suggesting that selecting where to distill is a key factor for both effective and efficient long-horizon agent training.