長期的視点を持つLLMエージェントにおける後知恵的信用配分

要旨

大規模言語モデル（LLM）エージェントは、報酬が疎な長期多段階タスクにおいて、重大な信用割り当て問題に直面することが多い。既存の価値関数非依存手法、例えばGroup Relative Policy Optimization（GRPO）は、不正確なステップレベルのQ値推定と、中間状態に対する不適切な価値ベースラインという2つの根本的ボトルネックに遭遇する。これらの制限を解決するため、我々はLLMエージェントに後顧的な信用割り当てを統合した初のフレームワークであるHCAPOを提案する。HCAPOはLLM自体を事後批評家として利用し、後顧的推論を通じてステップレベルのQ値を洗練させる。さらに、HCAPOのマルチスケール利得機制は、重要な意思決定状態における不正確な価値ベースラインを効果的に補完する。WebShopやALFWorldを含む3つの難易度の高いベンチマークでの評価により、HCAPOが常に最先端の強化学習手法を上回る性能を示すことが実証された。特に、Qwen2.5-7B-Instructモデル使用時、HCAPOはWebShopでGRPOに対し7.7%、ALFWorldで13.8%の成功率向上を達成した。これらの結果は、HCAPOが探索効率を大幅に向上させ、簡潔な意思決定を促進し、複雑な長期タスクにおけるスケーラビリティを保証することを示唆している。

English

Large Language Model (LLM) agents often face significant credit assignment challenges in long-horizon, multi-step tasks due to sparse rewards. Existing value-free methods, such as Group Relative Policy Optimization (GRPO), encounter two fundamental bottlenecks: inaccurate step-level Q-value estimation and misaligned value baselines for intermediate states. To address these limitations, we introduce HCAPO, the first framework to integrate hindsight credit assignment into LLM agents. HCAPO leverages the LLM itself as a post-hoc critic to refine step-level Q-values through hindsight reasoning. Furthermore, HCAPO's multi-scale advantage mechanism effectively supplements the inaccurate value baselines at critical decision states. Evaluations across three challenging benchmarks, including WebShop and ALFWorld, demonstrate that HCAPO consistently outperforms state-of-the-art RL methods. Notably, HCAPO achieves a 7.7% improvement in success rate on WebShop and a 13.8% on ALFWorld over GRPO using the Qwen2.5-7B-Instruct model. These results indicate that HCAPO significantly enhances exploration efficiency, promotes concise decision-making, and ensures scalability in complex, long-horizon tasks.

長期的視点を持つLLMエージェントにおける後知恵的信用配分

Hindsight Credit Assignment for Long-Horizon LLM Agents

要旨

Support