解讀無模型強化學習中的湧現式規劃

摘要

我們首次提出了機制性的證據，證明無模型強化學習代理能夠學會規劃。這一發現是通過將基於概念的可解釋性方法應用於Sokoban（一個常用於研究規劃的基準測試）中的無模型代理來實現的。具體而言，我們展示了由Guez等人（2019年）引入的通用無模型代理DRC，利用學習到的概念表徵在內部制定計劃，這些計劃既能預測行動對環境的長期影響，又能影響行動的選擇。我們的方法包括：（1）探測與規劃相關的概念，（2）研究代理表徵中的計劃形成過程，以及（3）通過干預驗證所發現的計劃（在代理的表徵中）對代理行為具有因果影響。我們還表明，這些計劃的出現與一種類似規劃的屬性的出現相吻合：即能夠從額外的測試時間計算中受益。最後，我們對代理學習到的規劃算法進行了定性分析，並發現其與並行化雙向搜索具有強烈的相似性。我們的研究成果增進了對代理中規劃行為內部機制的理解，這在當前大型語言模型（LLMs）通過強化學習湧現出規劃和推理能力的趨勢下顯得尤為重要。

English

We present the first mechanistic evidence that model-free reinforcement learning agents can learn to plan. This is achieved by applying a methodology based on concept-based interpretability to a model-free agent in Sokoban -- a commonly used benchmark for studying planning. Specifically, we demonstrate that DRC, a generic model-free agent introduced by Guez et al. (2019), uses learned concept representations to internally formulate plans that both predict the long-term effects of actions on the environment and influence action selection. Our methodology involves: (1) probing for planning-relevant concepts, (2) investigating plan formation within the agent's representations, and (3) verifying that discovered plans (in the agent's representations) have a causal effect on the agent's behavior through interventions. We also show that the emergence of these plans coincides with the emergence of a planning-like property: the ability to benefit from additional test-time compute. Finally, we perform a qualitative analysis of the planning algorithm learned by the agent and discover a strong resemblance to parallelized bidirectional search. Our findings advance understanding of the internal mechanisms underlying planning behavior in agents, which is important given the recent trend of emergent planning and reasoning capabilities in LLMs through RL

解讀無模型強化學習中的湧現式規劃

Interpreting Emergent Planning in Model-Free Reinforcement Learning

摘要

Support