自迴歸模型中的湧現時間抽象化實現了分層強化學習
Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning
December 23, 2025
作者: Seijin Kobayashi, Yanick Schimpf, Maximilian Schlegel, Angelika Steger, Maciej Wolczyk, Johannes von Oswald, Nino Scherrer, Kaitlin Maile, Guillaume Lajoie, Blake A. Richards, Rif A. Saurous, James Manyika, Blaise Agüera y Arcas, Alexander Meulemans, João Sacramento
cs.AI
摘要
基於下一個詞元預測進行大規模預訓練、並透過強化學習(RL)進行微調的自迴歸模型,已在多個問題領域取得前所未有的成功。在強化學習過程中,這類模型透過逐詞元生成新輸出進行探索。然而,逐詞元採樣行動可能導致學習效率低下,尤其在獎勵稀疏的情境下更為明顯。本文證明,透過在自迴歸模型的內部表徵空間中進行行動與探索,可有效解決此問題。具體而言,為發現時序抽象的行動,我們引入一種高階非因果序列模型,其輸出可控制基礎自迴歸模型的殘差流激活狀態。在具層級結構的網格世界與MuJoCo任務中,我們發現高階模型能將長激活序列塊壓縮至內部控制器。關鍵在於,每個控制器能執行行為意義明確的行動序列,這些行動在長時間尺度上展開並附帶學習得到的終止條件,使得隨時間組合多個控制器可實現新任務的高效探索。我們提出「內部強化學習」——即直接對內部控制器進行強化的過程,能在標準RL微調失效的稀疏獎勵場景中實現有效學習。研究結果揭示了自迴歸模型中潛在行動生成與強化的優勢,表明內部強化學習可作為實現基礎模型中層級強化學習的可行路徑。
English
Large-scale autoregressive models pretrained on next-token prediction and finetuned with reinforcement learning (RL) have achieved unprecedented success on many problem domains. During RL, these models explore by generating new outputs, one token at a time. However, sampling actions token-by-token can result in highly inefficient learning, particularly when rewards are sparse. Here, we show that it is possible to overcome this problem by acting and exploring within the internal representations of an autoregressive model. Specifically, to discover temporally-abstract actions, we introduce a higher-order, non-causal sequence model whose outputs control the residual stream activations of a base autoregressive model. On grid world and MuJoCo-based tasks with hierarchical structure, we find that the higher-order model learns to compress long activation sequence chunks onto internal controllers. Critically, each controller executes a sequence of behaviorally meaningful actions that unfold over long timescales and are accompanied with a learned termination condition, such that composing multiple controllers over time leads to efficient exploration on novel tasks. We show that direct internal controller reinforcement, a process we term "internal RL", enables learning from sparse rewards in cases where standard RL finetuning fails. Our results demonstrate the benefits of latent action generation and reinforcement in autoregressive models, suggesting internal RL as a promising avenue for realizing hierarchical RL within foundation models.