モデルフリー強化学習における創発的計画の解釈

要旨

モデルフリー強化学習エージェントが計画を学習できることを示す初のメカニズム的証拠を提示する。これは、概念ベースの解釈可能性に基づく方法論を、計画研究のための一般的なベンチマークであるSokobanにおけるモデルフリーエージェントに適用することで達成された。具体的には、Guezら（2019）によって導入された汎用モデルフリーエージェントであるDRCが、学習された概念表現を用いて、行動の長期的な環境への影響を予測し、行動選択に影響を与える計画を内部で策定していることを実証する。我々の方法論は、(1) 計画に関連する概念の探索、(2) エージェントの表現内での計画形成の調査、(3) 発見された計画（エージェントの表現内）が介入を通じてエージェントの行動に因果的影響を及ぼすことの検証、を含む。また、これらの計画の出現が、追加のテスト時間計算を活用する能力という計画に似た特性の出現と一致することを示す。最後に、エージェントが学習した計画アルゴリズムの質的分析を行い、並列化された双方向探索との強い類似性を発見する。我々の知見は、最近のLLMにおけるRLを通じた計画と推論能力の創発的傾向を踏まえ、エージェントの計画行動の内部メカニズムの理解を進めるものである。

English

We present the first mechanistic evidence that model-free reinforcement learning agents can learn to plan. This is achieved by applying a methodology based on concept-based interpretability to a model-free agent in Sokoban -- a commonly used benchmark for studying planning. Specifically, we demonstrate that DRC, a generic model-free agent introduced by Guez et al. (2019), uses learned concept representations to internally formulate plans that both predict the long-term effects of actions on the environment and influence action selection. Our methodology involves: (1) probing for planning-relevant concepts, (2) investigating plan formation within the agent's representations, and (3) verifying that discovered plans (in the agent's representations) have a causal effect on the agent's behavior through interventions. We also show that the emergence of these plans coincides with the emergence of a planning-like property: the ability to benefit from additional test-time compute. Finally, we perform a qualitative analysis of the planning algorithm learned by the agent and discover a strong resemblance to parallelized bidirectional search. Our findings advance understanding of the internal mechanisms underlying planning behavior in agents, which is important given the recent trend of emergent planning and reasoning capabilities in LLMs through RL

モデルフリー強化学習における創発的計画の解釈

Interpreting Emergent Planning in Model-Free Reinforcement Learning

要旨

Support