ChatPaper.aiChatPaper

探索之道,唯在利用...以達探索之極

Exploitation Is All You Need... for Exploration

August 2, 2025
作者: Micah Rentschler, Jesse Roberts
cs.AI

摘要

確保充分的探索是訓練元強化學習(meta-RL)智能體以解決新環境時的核心挑戰。傳統的探索與利用困境解決方案通過注入明確的激勵機制,如隨機化、不確定性獎勵或內在獎勵,來鼓勵探索。在本研究中,我們提出假設:僅以最大化貪婪(僅利用)目標訓練的智能體,在滿足三個條件的情況下,仍能展現出湧現的探索行為:(1)重複的環境結構,即環境中存在可重複的規律性,使得過去的經驗能指導未來的選擇;(2)智能體記憶,使智能體能夠保留並利用歷史交互數據;(3)長期信用分配,即學習在足夠長的時間範圍內傳播回報,使得探索的延遲收益能影響當前決策。通過在隨機多臂老虎機和時間延展的網格世界中的實驗,我們觀察到,當結構和記憶同時存在時,以嚴格貪婪目標訓練的策略會表現出信息尋求的探索行為。我們進一步通過控制性消融實驗證明,如果環境結構或智能體記憶缺失(條件1和2),湧現的探索行為會消失。令人驚訝的是,移除長期信用分配(條件3)並不總是阻止湧現的探索——這一結果我們歸因於偽湯普森採樣效應。這些發現表明,在適當的前提條件下,探索與利用不必被視為正交目標,而是可以從統一的獎勵最大化過程中自然湧現。
English
Ensuring sufficient exploration is a central challenge when training meta-reinforcement learning (meta-RL) agents to solve novel environments. Conventional solutions to the exploration-exploitation dilemma inject explicit incentives such as randomization, uncertainty bonuses, or intrinsic rewards to encourage exploration. In this work, we hypothesize that an agent trained solely to maximize a greedy (exploitation-only) objective can nonetheless exhibit emergent exploratory behavior, provided three conditions are met: (1) Recurring Environmental Structure, where the environment features repeatable regularities that allow past experience to inform future choices; (2) Agent Memory, enabling the agent to retain and utilize historical interaction data; and (3) Long-Horizon Credit Assignment, where learning propagates returns over a time frame sufficient for the delayed benefits of exploration to inform current decisions. Through experiments in stochastic multi-armed bandits and temporally extended gridworlds, we observe that, when both structure and memory are present, a policy trained on a strictly greedy objective exhibits information-seeking exploratory behavior. We further demonstrate, through controlled ablations, that emergent exploration vanishes if either environmental structure or agent memory is absent (Conditions 1 & 2). Surprisingly, removing long-horizon credit assignment (Condition 3) does not always prevent emergent exploration-a result we attribute to the pseudo-Thompson Sampling effect. These findings suggest that, under the right prerequisites, exploration and exploitation need not be treated as orthogonal objectives but can emerge from a unified reward-maximization process.
PDF52August 5, 2025