Toewijzing van Krediet Achteraf voor LLM-Agenten met Lange Horizons

Samenvatting

Grote Taalmodellen (LLM) als agenten kampen vaak met aanzienlijke krediettoewijzingsproblemen bij langetermijn, meerstappentaken vanwege schaarse beloningen. Bestaande waardevrije methoden, zoals Group Relative Policy Optimization (GRPO), stuiten op twee fundamentele knelpunten: onnauwkeurige schatting van Q-waarden op stapniveau en verkeerd uitgelijnde waardebaselines voor tussenliggende toestanden. Om deze beperkingen aan te pakken, introduceren wij HCAPO, het eerste raamwerk dat hindsight krediettoewijzing integreert in LLM-agenten. HCAPO benut het LLM zelf als een post-hoc criticus om Q-waarden op stapniveau te verfijnen via hindsight redenering. Verder voorziet HCAPO's multi-schaal voordeelmechanisme effectief in de onnauwkeurige waardebaselines bij kritieke beslissingstoestanden. Evaluaties over drie uitdagende benchmarks, waaronder WebShop en ALFWorld, tonen aan dat HCAPO consistent beter presteert dan state-of-the-art RL-methoden. Opmerkelijk is dat HCAPO een verbetering van 7,7% in succespercentage behaalt op WebShop en 13,8% op ALFWorld ten opzichte van GRPO bij gebruik van het Qwen2.5-7B-Instruct model. Deze resultaten geven aan dat HCAPO de verkennings efficiëntie aanzienlijk verbetert, beknopte besluitvorming bevordert en schaalbaarheid waarborgt in complexe, langetermijntaken.

English

Large Language Model (LLM) agents often face significant credit assignment challenges in long-horizon, multi-step tasks due to sparse rewards. Existing value-free methods, such as Group Relative Policy Optimization (GRPO), encounter two fundamental bottlenecks: inaccurate step-level Q-value estimation and misaligned value baselines for intermediate states. To address these limitations, we introduce HCAPO, the first framework to integrate hindsight credit assignment into LLM agents. HCAPO leverages the LLM itself as a post-hoc critic to refine step-level Q-values through hindsight reasoning. Furthermore, HCAPO's multi-scale advantage mechanism effectively supplements the inaccurate value baselines at critical decision states. Evaluations across three challenging benchmarks, including WebShop and ALFWorld, demonstrate that HCAPO consistently outperforms state-of-the-art RL methods. Notably, HCAPO achieves a 7.7% improvement in success rate on WebShop and a 13.8% on ALFWorld over GRPO using the Qwen2.5-7B-Instruct model. These results indicate that HCAPO significantly enhances exploration efficiency, promotes concise decision-making, and ensures scalability in complex, long-horizon tasks.

Toewijzing van Krediet Achteraf voor LLM-Agenten met Lange Horizons

Hindsight Credit Assignment for Long-Horizon LLM Agents

Samenvatting

Support