ChatPaper.aiChatPaper

StraTA:基于战略轨迹抽象增强的智能体强化学习激励机制

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

May 7, 2026
作者: Xiangyuan Xue, Yifan Zhou, Zidong Wang, Shengji Tang, Philip Torr, Wanli Ouyang, Lei Bai, Zhenfei Yin
cs.AI

摘要

大型语言模型(LLMs)作为交互式智能体的应用日益广泛,但针对长周期决策的优化仍存在挑战——当前方法主要依赖被动响应模式,这削弱了长轨迹中的探索能力与信用分配效能。本研究提出战略轨迹抽象框架(StraTA),通过将显式的轨迹级策略引入智能体强化学习(RL)来简化该问题。StraTA从初始任务状态中采样生成精简策略,以此策略为条件指导后续行动,并采用分层GRPO式滚动设计联合训练策略生成与动作执行模块,同时通过多样化策略滚动与关键性自我评判机制进一步强化性能。在ALFWorld、WebShop和SciWorld平台上的实验表明,StraTA在样本效率和最终性能上均稳定优于现有强基线模型:在ALFWorld中达到93.1%的成功率,在WebShop中取得84.2%的得分,在SciWorld上更以63.5%的综合评分超越了前沿闭源模型的表现。
English
Large language models (LLMs) are increasingly used as interactive agents, but optimizing them for long-horizon decision making remains difficult because current methods are largely purely reactive, which weakens both exploration and credit assignment over extended trajectories. In this work, we present Strategic Trajectory Abstraction (StraTA), a simple framework that introduces an explicit trajectory-level strategy into agentic reinforcement learning (RL). StraTA samples a compact strategy from the initial task state, conditions subsequent actions on that strategy, and trains strategy generation and action execution jointly with a hierarchical GRPO-style rollout design, further enhanced by diverse strategy rollout and critical self-judgment. Experiments on ALFWorld, WebShop, and SciWorld show that StraTA consistently improves both sample efficiency and final performance over strong baselines. StraTA reaches success rates of 93.1% on ALFWorld and 84.2% on WebShop. On SciWorld, StraTA attains a 63.5% overall score, outperforming frontier closed-source models.
PDF101May 9, 2026