StraTA: 戦略的軌道抽象化による能動的強化学習の促進

要旨

大規模言語モデル（LLM）は対話型エージェントとしてますます利用されているが、長期的な意思決定のための最適化は依然として困難である。これは、現在の手法が主に純粋な反応型であるため、長い軌跡における探索と信用割り当ての両方が弱まってしまうためである。本研究では、戦略的軌道抽象化（StraTA）を提案する。これは、エージェント強化学習（RL）に明示的な軌道レベルの戦略を導入するシンプルなフレームワークである。StraTAは、初期タスク状態からコンパクトな戦略をサンプリングし、その後の行動をその戦略に条件付け、階層型GRPOスタイルのロールアウト設計により戦略生成と行動実行を共同で訓練する。さらに、多様な戦略ロールアウトと批判的自己判断によって強化されている。ALFWorld、WebShop、SciWorldにおける実験により、StraTAが強力なベースラインと比較して、サンプル効率と最終性能の両方を一貫して向上させることが示された。StraTAは、ALFWorldで93.1%、WebShopで84.2%の成功率を達成した。SciWorldでは、StraTAは63.5%の総合スコアを獲得し、最先端のクローズドソースモデルを上回った。

English

Large language models (LLMs) are increasingly used as interactive agents, but optimizing them for long-horizon decision making remains difficult because current methods are largely purely reactive, which weakens both exploration and credit assignment over extended trajectories. In this work, we present Strategic Trajectory Abstraction (StraTA), a simple framework that introduces an explicit trajectory-level strategy into agentic reinforcement learning (RL). StraTA samples a compact strategy from the initial task state, conditions subsequent actions on that strategy, and trains strategy generation and action execution jointly with a hierarchical GRPO-style rollout design, further enhanced by diverse strategy rollout and critical self-judgment. Experiments on ALFWorld, WebShop, and SciWorld show that StraTA consistently improves both sample efficiency and final performance over strong baselines. StraTA reaches success rates of 93.1% on ALFWorld and 84.2% on WebShop. On SciWorld, StraTA attains a 63.5% overall score, outperforming frontier closed-source models.

StraTA: 戦略的軌道抽象化による能動的強化学習の促進

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

要旨

Support