HINT-SD：面向長期時域智能體的針對性事後自我蒸餾

摘要

使用强化学习训练长周期LLM智能体具有挑战性，因为稀疏的结果奖励仅能揭示任务是否成功，却无法表明哪些中间动作导致了该结果，以及应如何纠正这些动作。近年来的方法通过从逐回合的动作-输出信号中生成奖励或文本提示，或者利用反馈条件自蒸馏来缓解这一问题。然而，在大量中间回合已经成功或中性的情况下，每个回合都生成反馈效率低下；而在固定或错位的回合上应用反馈，往往无法监督导致失败的那些动作。为弥合这一差距，我们提出了HINT-SD，一种针对性自蒸馏框架，它利用全轨迹事后分析来选择与失败相关的动作，并仅在目标动作跨度上应用反馈条件蒸馏。在BFCL v3和AppWorld上的实验表明，我们的方法相比密集的逐回合反馈基线，性能提升最高达18.80%，同时每个训练步骤的时间降低2.26倍，这表明选择蒸馏的位置是实现高效且有效的长周期智能体训练的关键因素。

English

Training long-horizon LLM agents with reinforcement learning is challenging because sparse outcome rewards reveal whether a task succeeds, but not which intermediate actions caused the outcome or how they should be corrected. Recent methods alleviate this issue by generating rewards or textual hints from turn-level action-output signals, or by using feedback-conditioned self-distillation. However, generating feedback at every turn is inefficient when many intermediate turns are already successful or neutral, and applying feedback at a fixed or misaligned turn often fails to supervise the actions that contributed to the failure. To bridge this gap, we propose HINT-SD, a targeted self-distillation framework that uses full-trajectory hindsight to select failure-relevant actions and applies feedback-conditioned distillation only on targeted action spans. Experiments on BFCL v3 and AppWorld show that our method improves over the dense per-turn feedback baseline by up to 18.80 percent while achieving 2.26times lower time per training step, suggesting that selecting where to distill is a key factor for both effective and efficient long-horizon agent training.