探索之道，唯在利用...以求新知

摘要

确保充分的探索是训练元强化学习（meta-RL）智能体以解决新环境时的核心挑战。传统解决探索与利用困境的方法通过引入显式激励，如随机化、不确定性奖励或内在奖励，来鼓励探索。在本研究中，我们提出假设：仅以最大化贪婪（仅利用）目标训练的智能体，若满足以下三个条件，仍可展现出涌现的探索行为：(1) 循环环境结构，即环境中存在可重复的规律性，使得过去的经验能指导未来选择；(2) 智能体记忆，使智能体能够保留并利用历史交互数据；(3) 长期信用分配，即学习过程在足够长的时间范围内传播回报，使得探索的延迟收益能影响当前决策。通过在随机多臂老虎机和时间扩展的网格世界中的实验，我们观察到，当结构和记忆同时存在时，基于严格贪婪目标训练的策略会表现出信息寻求的探索行为。我们进一步通过控制消融实验证明，若缺乏环境结构或智能体记忆（条件1和2），涌现的探索行为将消失。令人惊讶的是，移除长期信用分配（条件3）并不总是阻止涌现的探索——我们将此结果归因于伪汤普森采样效应。这些发现表明，在适当的先决条件下，探索与利用不必被视为正交目标，而是可以从统一的奖励最大化过程中自然涌现。

English

Ensuring sufficient exploration is a central challenge when training meta-reinforcement learning (meta-RL) agents to solve novel environments. Conventional solutions to the exploration-exploitation dilemma inject explicit incentives such as randomization, uncertainty bonuses, or intrinsic rewards to encourage exploration. In this work, we hypothesize that an agent trained solely to maximize a greedy (exploitation-only) objective can nonetheless exhibit emergent exploratory behavior, provided three conditions are met: (1) Recurring Environmental Structure, where the environment features repeatable regularities that allow past experience to inform future choices; (2) Agent Memory, enabling the agent to retain and utilize historical interaction data; and (3) Long-Horizon Credit Assignment, where learning propagates returns over a time frame sufficient for the delayed benefits of exploration to inform current decisions. Through experiments in stochastic multi-armed bandits and temporally extended gridworlds, we observe that, when both structure and memory are present, a policy trained on a strictly greedy objective exhibits information-seeking exploratory behavior. We further demonstrate, through controlled ablations, that emergent exploration vanishes if either environmental structure or agent memory is absent (Conditions 1 & 2). Surprisingly, removing long-horizon credit assignment (Condition 3) does not always prevent emergent exploration-a result we attribute to the pseudo-Thompson Sampling effect. These findings suggest that, under the right prerequisites, exploration and exploitation need not be treated as orthogonal objectives but can emerge from a unified reward-maximization process.

探索之道，唯在利用...以求新知

Exploitation Is All You Need... for Exploration

摘要

Support