RetroAgent: 回顧的デュアル内発的フィードバックによる問題解決から進化へ

要旨

強化学習（RL）により訓練された大規模言語モデル（LLM）ベースのエージェントは、複雑な対話型タスクにおいて高い可能性を示している。しかし、標準的なRLのパラダイムは継続的な適応よりも静的な問題解決を重視する傾向があり、探索不足によりエージェントはしばしば最適ではない戦略に収束してしまう。さらに、獲得した知識はパラメータ内に暗黙的に埋め込まれたままで明示的に取り出すことができず、効果的な経験学習が制限されている。これらの課題を解決するため、我々は問題を解決するだけでなく、進化を通じて複雑な対話環境を習得することをエージェントに可能にするオンラインRLフレームワーク、RetroAgentを提案する。具体的には、RetroAgentは後知恵的自省メカニズムを備え、二重の内発的フィードバックを生成する。(1) 過去の試行との比較で段階的なサブタスクの達成度を追跡し、有望な探索を報酬とする内発的数値フィードバック、および(2) 再利用可能な教訓をメモリバッファに抽出し、関連性、有用性、探索のバランスを取り過去の経験を効果的に活用するために提案されたSimilarity & Utility-Aware Upper Confidence Bound（SimUtil-UCB）戦略により検索される内発的言語フィードバックである。4つの困難なエージェントタスクにおける2つのモデルファミリーを用いた大規模な実験により、RetroAgentが既存手法を大幅に上回り、例えばALFWorldではGroup Relative Policy Optimization（GRPO）で訓練されたエージェントを+18.3%、WebShopで+15.4%、Sokobanで+27.1%、MineSweeperで+8.9%上回る、状態-of-the-artの結果を達成することを実証した。さらに、テスト時の適応性および分布外シナリオへの強い一般化能力を示した。

English

Large language model (LLM)-based agents trained with reinforcement learning (RL) have shown strong potential on complex interactive tasks. However, standard RL paradigms favor static problem-solving over continuous adaptation: agents often converge to suboptimal strategies due to insufficient exploration, while learned knowledge remains implicit within parameters rather than explicitly retrievable, limiting effective experiential learning. To address these limitations, we introduce RetroAgent, an online RL framework that empowers agents to master complex interactive environments not just by solving, but by evolving. Concretely, RetroAgent features a hindsight self-reflection mechanism that produces dual intrinsic feedback: (1) intrinsic numerical feedback that that tracks incremental subtask completion relative to prior attempts, rewarding promising explorations, and (2) intrinsic language feedback that distills reusable lessons into a memory buffer, retrieved via our proposed Similarity & Utility-Aware Upper Confidence Bound (SimUtil-UCB) strategy balancing relevance, utility, and exploration to effectively leverage past experiences. Extensive experiments on two model families across four challenging agentic tasks demonstrate that RetroAgent significantly outperforms existing methods, achieving state-of-the-art results -- e.g., surpassing Group Relative Policy Optimization (GRPO)-trained agents by +18.3% on ALFWorld, +15.4% on WebShop, +27.1% on Sokoban, and +8.9% on MineSweeper -- while exhibiting strong test-time adaptation and generalization to out-of-distribution scenarios.

RetroAgent: 回顧的デュアル内発的フィードバックによる問題解決から進化へ

RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback

要旨

Support