探索のためには搾取こそがすべて...

要旨

新たな環境を解決するためのメタ強化学習（meta-RL）エージェントを訓練する際に、十分な探索を確保することは中心的な課題である。探索と活用のジレンマに対する従来の解決策は、ランダム化、不確実性ボーナス、または内在的報酬といった明示的なインセンティブを注入して探索を促進するものであった。本研究では、貪欲（活用のみ）の目的を最大化するように訓練されたエージェントであっても、以下の3つの条件が満たされれば、創発的な探索行動を示すことができると仮説を立てる：(1) 繰り返し可能な環境構造、すなわち環境が過去の経験を将来の選択に反映させる繰り返し可能な規則性を有すること；(2) エージェントの記憶、すなわちエージェントが過去の相互作用データを保持し活用できること；(3) 長期的なクレジット割り当て、すなわち学習が探索の遅延利益が現在の決定に反映されるのに十分な時間枠で報酬を伝播すること。確率的な多腕バンディット問題および時間的に拡張されたグリッドワールドにおける実験を通じて、構造と記憶の両方が存在する場合、厳密に貪欲な目的で訓練されたポリシーが情報探索的な行動を示すことを観察した。さらに、制御されたアブレーション実験を通じて、環境構造またはエージェントの記憶のいずれかが欠如すると（条件1および2）、創発的な探索が消失することを示した。驚くべきことに、長期的なクレジット割り当てを除去しても（条件3）、必ずしも創発的な探索が妨げられるわけではない。この結果は、疑似トンプソンサンプリング効果に起因すると考えられる。これらの発見は、適切な前提条件の下では、探索と活用を直交する目的として扱う必要はなく、統一された報酬最大化プロセスから創発し得ることを示唆している。

English

Ensuring sufficient exploration is a central challenge when training meta-reinforcement learning (meta-RL) agents to solve novel environments. Conventional solutions to the exploration-exploitation dilemma inject explicit incentives such as randomization, uncertainty bonuses, or intrinsic rewards to encourage exploration. In this work, we hypothesize that an agent trained solely to maximize a greedy (exploitation-only) objective can nonetheless exhibit emergent exploratory behavior, provided three conditions are met: (1) Recurring Environmental Structure, where the environment features repeatable regularities that allow past experience to inform future choices; (2) Agent Memory, enabling the agent to retain and utilize historical interaction data; and (3) Long-Horizon Credit Assignment, where learning propagates returns over a time frame sufficient for the delayed benefits of exploration to inform current decisions. Through experiments in stochastic multi-armed bandits and temporally extended gridworlds, we observe that, when both structure and memory are present, a policy trained on a strictly greedy objective exhibits information-seeking exploratory behavior. We further demonstrate, through controlled ablations, that emergent exploration vanishes if either environmental structure or agent memory is absent (Conditions 1 & 2). Surprisingly, removing long-horizon credit assignment (Condition 3) does not always prevent emergent exploration-a result we attribute to the pseudo-Thompson Sampling effect. These findings suggest that, under the right prerequisites, exploration and exploitation need not be treated as orthogonal objectives but can emerge from a unified reward-maximization process.

探索のためには搾取こそがすべて...

Exploitation Is All You Need... for Exploration

要旨

Support