HINT-SD: 장기 지평 에이전트를 위한 목표 지향 회고적 자기 증류

초록

장기적 LLM 에이전트를 강화 학습으로 훈련하는 것은 어려운데, 이는 희소한 결과 보상이 작업의 성공 여부를 알려주기는 하지만 어떤 중간 행동이 결과를 초래했는지 또는 어떻게 수정해야 하는지는 알려주지 않기 때문이다. 최근 방법들은 턴 수준의 행동-출력 신호로부터 보상이나 텍스트 힌트를 생성하거나, 피드백 조건부 자기 증류를 사용하여 이 문제를 완화한다. 그러나 모든 중간 턴이 이미 성공적이거나 중립적인 경우 매 턴마다 피드백을 생성하는 것은 비효율적이며, 고정되거나 부적절한 턴에 피드백을 적용하면 실패에 기여한 행동을 제대로 감독하지 못하는 경우가 많다. 이러한 격차를 해소하기 위해, 우리는 HINT-SD를 제안한다. 이는 전체 궤적 회고를 사용하여 실패 관련 행동을 선택하고, 선택된 행동 구간에만 피드백 조건부 증류를 적용하는 표적 자기 증류 프레임워크이다. BFCL v3 및 AppWorld에서의 실험 결과, 우리 방법이 밀집된 턴별 피드백 기준선보다 최대 18.80% 성능이 향상되면서 훈련 단계당 시간은 2.26배 감소하여, 증류 위치 선택이 효과적이고 효율적인 장기적 에이전트 훈련의 핵심 요소임을 시사한다.

English

Training long-horizon LLM agents with reinforcement learning is challenging because sparse outcome rewards reveal whether a task succeeds, but not which intermediate actions caused the outcome or how they should be corrected. Recent methods alleviate this issue by generating rewards or textual hints from turn-level action-output signals, or by using feedback-conditioned self-distillation. However, generating feedback at every turn is inefficient when many intermediate turns are already successful or neutral, and applying feedback at a fixed or misaligned turn often fails to supervise the actions that contributed to the failure. To bridge this gap, we propose HINT-SD, a targeted self-distillation framework that uses full-trajectory hindsight to select failure-relevant actions and applies feedback-conditioned distillation only on targeted action spans. Experiments on BFCL v3 and AppWorld show that our method improves over the dense per-turn feedback baseline by up to 18.80 percent while achieving 2.26times lower time per training step, suggesting that selecting where to distill is a key factor for both effective and efficient long-horizon agent training.