长视野大语言模型智能体的后见之明信用分配

摘要

在大规模语言模型（LLM）智能体处理长周期、多步骤任务时，由于奖励稀疏性常面临显著的信用分配难题。现有无价值函数方法（如GRPO）存在两大瓶颈：不精确的步骤级Q值估计与中间状态的价值基线失准。为突破这些局限，我们提出HCAPO——首个将事后信用分配机制集成至LLM智能体的框架。该框架利用LLM自身作为事后评判器，通过回溯推理优化步骤级Q值估计。此外，HCAPO的多尺度优势机制能有效补充关键决策状态下的不精确价值基线。在WebShop、ALFWorld等三大挑战性基准测试中，HCAPO均持续超越最先进的强化学习方法。值得注意的是，基于Qwen2.5-7B-Instruct模型，HCAPO在WebShop上的成功率较GRPO提升7.7%，在ALFWorld上提升13.8%。实验结果表明，HCAPO能显著提升探索效率，促进决策路径的简洁性，并确保复杂长周期任务中的可扩展性。

English

Large Language Model (LLM) agents often face significant credit assignment challenges in long-horizon, multi-step tasks due to sparse rewards. Existing value-free methods, such as Group Relative Policy Optimization (GRPO), encounter two fundamental bottlenecks: inaccurate step-level Q-value estimation and misaligned value baselines for intermediate states. To address these limitations, we introduce HCAPO, the first framework to integrate hindsight credit assignment into LLM agents. HCAPO leverages the LLM itself as a post-hoc critic to refine step-level Q-values through hindsight reasoning. Furthermore, HCAPO's multi-scale advantage mechanism effectively supplements the inaccurate value baselines at critical decision states. Evaluations across three challenging benchmarks, including WebShop and ALFWorld, demonstrate that HCAPO consistently outperforms state-of-the-art RL methods. Notably, HCAPO achieves a 7.7% improvement in success rate on WebShop and a 13.8% on ALFWorld over GRPO using the Qwen2.5-7B-Instruct model. These results indicate that HCAPO significantly enhances exploration efficiency, promotes concise decision-making, and ensures scalability in complex, long-horizon tasks.

长视野大语言模型智能体的后见之明信用分配

Hindsight Credit Assignment for Long-Horizon LLM Agents

摘要

Support