ChatPaper.aiChatPaper

Spark:基於動態分支的戰略策略感知探索實現長視野智能體學習

Spark: Strategic Policy-Aware Exploration via Dynamic Branching for Long-Horizon Agentic Learning

January 28, 2026
作者: Jinyang Wu, Shuo Yang, Changpeng Yang, Yuhao Shen, Shuai Zhang, Zhengqi Wen, Jianhua Tao
cs.AI

摘要

強化學習已使大型語言模型能夠作為智能代理運�作,但由於高質量軌跡的稀缺性(特別是在有限資源下),訓練其執行長視野任務仍具挑戰性。現有方法通常會擴大軌跡採樣規模,並無差別地分配計算資源給中間步驟。這種做法本質上會將大量計算預算浪費在平凡步驟上,且無法保證樣本質量。為解決此問題,我們提出Spark框架(基於關鍵狀態動態分支的戰略策略感知探索),通過在關鍵決策狀態選擇性分支來實現資源高效的探索。我們的核心洞見是:在關鍵決策點啟動自適應分支探索以探測潛在軌跡,從而實現優先考慮採樣質量而非盲目覆蓋的精準資源分配。該設計利用代理的內在決策信號來降低對人為先驗的依賴,使代理能自主擴展探索並實現更強的泛化能力。在多樣化任務(如具身規劃)上的實驗表明,Spark能以顯著更少的訓練樣本達成更高的成功率,並在未見過的場景中展現出強健的泛化性能。
English
Reinforcement learning has empowered large language models to act as intelligent agents, yet training them for long-horizon tasks remains challenging due to the scarcity of high-quality trajectories, especially under limited resources. Existing methods typically scale up rollout sizes and indiscriminately allocate computational resources among intermediate steps. Such attempts inherently waste substantial computation budget on trivial steps while failing to guarantee sample quality. To address this, we propose Spark (Strategic Policy-Aware exploRation via Key-state dynamic branching), a novel framework that selectively branches at critical decision states for resource-efficient exploration. Our key insight is to activate adaptive branching exploration at critical decision points to probe promising trajectories, thereby achieving precise resource allocation that prioritizes sampling quality over blind coverage. This design leverages the agent's intrinsic decision-making signals to reduce dependence on human priors, enabling the agent to autonomously expand exploration and achieve stronger generalization. Experiments across diverse tasks (e.g., embodied planning), demonstrate that Spark achieves superior success rates with significantly fewer training samples, exhibiting robust generalization even in unseen scenarios.
PDF121January 30, 2026