探索之道,唯在利用...以求新知
Exploitation Is All You Need... for Exploration
August 2, 2025
作者: Micah Rentschler, Jesse Roberts
cs.AI
摘要
确保充分的探索是训练元强化学习(meta-RL)智能体以解决新环境时的核心挑战。传统解决探索与利用困境的方法通过引入显式激励,如随机化、不确定性奖励或内在奖励,来鼓励探索。在本研究中,我们提出假设:仅以最大化贪婪(仅利用)目标训练的智能体,若满足以下三个条件,仍可展现出涌现的探索行为:(1) 循环环境结构,即环境中存在可重复的规律性,使得过去的经验能指导未来选择;(2) 智能体记忆,使智能体能够保留并利用历史交互数据;(3) 长期信用分配,即学习过程在足够长的时间范围内传播回报,使得探索的延迟收益能影响当前决策。通过在随机多臂老虎机和时间扩展的网格世界中的实验,我们观察到,当结构和记忆同时存在时,基于严格贪婪目标训练的策略会表现出信息寻求的探索行为。我们进一步通过控制消融实验证明,若缺乏环境结构或智能体记忆(条件1和2),涌现的探索行为将消失。令人惊讶的是,移除长期信用分配(条件3)并不总是阻止涌现的探索——我们将此结果归因于伪汤普森采样效应。这些发现表明,在适当的先决条件下,探索与利用不必被视为正交目标,而是可以从统一的奖励最大化过程中自然涌现。
English
Ensuring sufficient exploration is a central challenge when training
meta-reinforcement learning (meta-RL) agents to solve novel environments.
Conventional solutions to the exploration-exploitation dilemma inject explicit
incentives such as randomization, uncertainty bonuses, or intrinsic rewards to
encourage exploration. In this work, we hypothesize that an agent trained
solely to maximize a greedy (exploitation-only) objective can nonetheless
exhibit emergent exploratory behavior, provided three conditions are met: (1)
Recurring Environmental Structure, where the environment features repeatable
regularities that allow past experience to inform future choices; (2) Agent
Memory, enabling the agent to retain and utilize historical interaction data;
and (3) Long-Horizon Credit Assignment, where learning propagates returns over
a time frame sufficient for the delayed benefits of exploration to inform
current decisions. Through experiments in stochastic multi-armed bandits and
temporally extended gridworlds, we observe that, when both structure and memory
are present, a policy trained on a strictly greedy objective exhibits
information-seeking exploratory behavior. We further demonstrate, through
controlled ablations, that emergent exploration vanishes if either
environmental structure or agent memory is absent (Conditions 1 & 2).
Surprisingly, removing long-horizon credit assignment (Condition 3) does not
always prevent emergent exploration-a result we attribute to the
pseudo-Thompson Sampling effect. These findings suggest that, under the right
prerequisites, exploration and exploitation need not be treated as orthogonal
objectives but can emerge from a unified reward-maximization process.