RetroAgent: 회고적 이중 내재적 피드백을 통한 문제 해결에서 진화로

초록

강화 학습(RL)으로 훈련된 대규모 언어 모델(LLM) 기반 에이전트는 복잡한 상호작용 과제에서 강력한 잠재력을 보여왔다. 그러나 표준 RL 패러다임은 지속적인 적응보다는 정적 문제 해결에 치우치는 경향이 있다: 에이전트는 불충분한 탐색으로 인해 종종 차선책 전략으로 수렴하며, 습득된 지식은 매개변수 내에 암묵적으로 남아 명시적으로 검색이 불가능하여 효과적인 경험 학습을 제한한다. 이러한 한계를 해결하기 위해 우리는 에이전트가 단순히 문제를 해결하는 것을 넘어 진화함으로써 복잡한 상호작용 환경을 숙달하도록 하는 온라인 RL 프레임워크인 RetroAgent를 소개한다. 구체적으로 RetroAgent는 사후 자기 성찰(hindsight self-reflection) 메커니즘을 특징으로 하며, 이는 두 가지 내재적 피드백을 생성한다: (1) 이전 시도 대비 점진적 하위 과제 완료도를 추적하여 유망한 탐색에 보상을 주는 내재적 수치 피드백, 그리고 (2) 재사용 가능한 교훈을 메모리 버퍼에 정제하여 저장하고, 제안된 유사성 및 유틸리티 인식 상한 신뢰 구간(SimUtil-UCB) 전략을 통해 관련성, 유용성, 탐색을 균형 있게 조정하여 과거 경험을 효과적으로 활용하도록 하는 내재적 언어 피드백. 4개의 도전적인 에이전트 과제에 걸쳐 두 모델 패밀리를 대상으로 한 광범위한 실험을 통해 RetroAgent가 기존 방법을 크게 능가하는 우수한 성능을 보여주며, 예를 들어 ALFWorld에서 GRPO(Group Relative Policy Optimization)로 훈련된 에이전트 대비 +18.3%, WebShop에서 +15.4%, Sokoban에서 +27.1%, MineSweeper에서 +8.9% 향상된 최첨단 결과를 달성함과 동시에 강력한 테스트 시간 적응 능력과 분포 외 시나리오에 대한 일반화 능력을 보여줌을 입증하였다.

English

Large language model (LLM)-based agents trained with reinforcement learning (RL) have shown strong potential on complex interactive tasks. However, standard RL paradigms favor static problem-solving over continuous adaptation: agents often converge to suboptimal strategies due to insufficient exploration, while learned knowledge remains implicit within parameters rather than explicitly retrievable, limiting effective experiential learning. To address these limitations, we introduce RetroAgent, an online RL framework that empowers agents to master complex interactive environments not just by solving, but by evolving. Concretely, RetroAgent features a hindsight self-reflection mechanism that produces dual intrinsic feedback: (1) intrinsic numerical feedback that that tracks incremental subtask completion relative to prior attempts, rewarding promising explorations, and (2) intrinsic language feedback that distills reusable lessons into a memory buffer, retrieved via our proposed Similarity & Utility-Aware Upper Confidence Bound (SimUtil-UCB) strategy balancing relevance, utility, and exploration to effectively leverage past experiences. Extensive experiments on two model families across four challenging agentic tasks demonstrate that RetroAgent significantly outperforms existing methods, achieving state-of-the-art results -- e.g., surpassing Group Relative Policy Optimization (GRPO)-trained agents by +18.3% on ALFWorld, +15.4% on WebShop, +27.1% on Sokoban, and +8.9% on MineSweeper -- while exhibiting strong test-time adaptation and generalization to out-of-distribution scenarios.

RetroAgent: 회고적 이중 내재적 피드백을 통한 문제 해결에서 진화로

RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback

초록

Support