自回归模型中涌现的时间抽象能力实现了分层强化学习
Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning
December 23, 2025
作者: Seijin Kobayashi, Yanick Schimpf, Maximilian Schlegel, Angelika Steger, Maciej Wolczyk, Johannes von Oswald, Nino Scherrer, Kaitlin Maile, Guillaume Lajoie, Blake A. Richards, Rif A. Saurous, James Manyika, Blaise Agüera y Arcas, Alexander Meulemans, João Sacramento
cs.AI
摘要
基于下一词元预测进行预训练、并通过强化学习微调的大规模自回归模型已在众多问题领域取得前所未有的成功。在强化学习过程中,这些模型通过逐词元生成新输出来进行探索。然而,逐词元采样行动可能导致学习效率低下,尤其在奖励稀疏的情况下。本文证明,通过利用自回归模型的内部表征进行行动与探索,能够有效解决该问题。具体而言,为发现时序抽象动作,我们引入了一种高阶非因果序列模型,其输出可控制基础自回归模型的残差流激活状态。在具有层次结构的网格世界和MuJoCo任务中,高阶模型学会将长激活序列块压缩至内部控制器。关键的是,每个控制器可执行行为意义明确且跨越长时间尺度的动作序列,并配备学习得到的终止条件,使得随时间组合多个控制器能够在新任务上实现高效探索。我们提出的"内部强化学习"方法——即直接对内部控制器进行强化激励——能够在标准强化学习微调失效的稀疏奖励场景中实现有效学习。研究结果揭示了在自回归模型中实施潜在动作生成与强化的优势,表明内部强化学习为实现基础模型中的分层强化学习提供了可行路径。
English
Large-scale autoregressive models pretrained on next-token prediction and finetuned with reinforcement learning (RL) have achieved unprecedented success on many problem domains. During RL, these models explore by generating new outputs, one token at a time. However, sampling actions token-by-token can result in highly inefficient learning, particularly when rewards are sparse. Here, we show that it is possible to overcome this problem by acting and exploring within the internal representations of an autoregressive model. Specifically, to discover temporally-abstract actions, we introduce a higher-order, non-causal sequence model whose outputs control the residual stream activations of a base autoregressive model. On grid world and MuJoCo-based tasks with hierarchical structure, we find that the higher-order model learns to compress long activation sequence chunks onto internal controllers. Critically, each controller executes a sequence of behaviorally meaningful actions that unfold over long timescales and are accompanied with a learned termination condition, such that composing multiple controllers over time leads to efficient exploration on novel tasks. We show that direct internal controller reinforcement, a process we term "internal RL", enables learning from sparse rewards in cases where standard RL finetuning fails. Our results demonstrate the benefits of latent action generation and reinforcement in autoregressive models, suggesting internal RL as a promising avenue for realizing hierarchical RL within foundation models.