火花:基于动态分支的战略策略感知探索在长视野智能体学习中的应用
Spark: Strategic Policy-Aware Exploration via Dynamic Branching for Long-Horizon Agentic Learning
January 28, 2026
作者: Jinyang Wu, Shuo Yang, Changpeng Yang, Yuhao Shen, Shuai Zhang, Zhengqi Wen, Jianhua Tao
cs.AI
摘要
强化学习已使大语言模型能够作为智能体执行任务,然而由于高质量轨迹的稀缺性,尤其是在有限资源下训练其完成长周期任务仍具挑战。现有方法通常通过扩大采样规模并无差别地在中间步骤间分配计算资源,这种做法本质上会在无关紧要的步骤上浪费大量计算预算,且无法保证样本质量。为此,我们提出Spark(基于关键状态动态分支的战略策略感知探索)这一新型框架,通过在关键决策状态进行选择性分支来实现资源高效的探索。我们的核心思路是在关键决策点启动自适应分支探索以探测潜在优质轨迹,从而实现优先保障采样质量而非盲目覆盖的精准资源分配。该设计利用智能体内在的决策信号减少对人类先验知识的依赖,使其能自主扩展探索范围并实现更强的泛化能力。在多项任务(如具身规划)上的实验表明,Spark能以显著更少的训练样本达成更高的成功率,即使在未见场景中也展现出稳健的泛化性能。
English
Reinforcement learning has empowered large language models to act as intelligent agents, yet training them for long-horizon tasks remains challenging due to the scarcity of high-quality trajectories, especially under limited resources. Existing methods typically scale up rollout sizes and indiscriminately allocate computational resources among intermediate steps. Such attempts inherently waste substantial computation budget on trivial steps while failing to guarantee sample quality. To address this, we propose Spark (Strategic Policy-Aware exploRation via Key-state dynamic branching), a novel framework that selectively branches at critical decision states for resource-efficient exploration. Our key insight is to activate adaptive branching exploration at critical decision points to probe promising trajectories, thereby achieving precise resource allocation that prioritizes sampling quality over blind coverage. This design leverages the agent's intrinsic decision-making signals to reduce dependence on human priors, enabling the agent to autonomously expand exploration and achieve stronger generalization. Experiments across diverse tasks (e.g., embodied planning), demonstrate that Spark achieves superior success rates with significantly fewer training samples, exhibiting robust generalization even in unseen scenarios.