元强化学习驱动语言智能体的探索行为

摘要

强化学习（RL）已能训练大型语言模型（LLM）智能体与环境交互，并完成多轮次长周期任务。然而，经过RL训练的智能体在需要主动探索的任务中往往表现不佳，且难以通过试错经验实现高效适应。本文提出LaMer——一种通用元强化学习框架，使LLM智能体能够在测试阶段主动探索并学习环境反馈。该框架包含两个核心组件：（i）跨轮次训练机制，以激励探索并优化长期回报；（ii）基于反思的上下文策略自适应，使智能体无需梯度更新即可根据任务反馈信号调整策略。多环境实验表明，LaMer在推箱子、扫雷和在线购物三类任务中分别以11%、14%和19%的性能提升显著优于基线RL方法。此外，与RL训练智能体相比，LaMer在应对更具挑战性或未经预训练的任务时展现出更优的泛化能力。本研究结果表明，元强化学习为语言智能体提供了一种诱导探索的机制化方法，通过习得的探索策略实现对新颖环境的更强适应能力。

English

Reinforcement learning (RL) has enabled the training of large language model (LLM) agents to interact with the environment and to solve multi-turn long-horizon tasks. However, the RL-trained agents often struggle in tasks that require active exploration and fail to efficiently adapt from trial-and-error experiences. In this paper, we present LaMer, a general Meta-RL framework that enables LLM agents to actively explore and learn from the environment feedback at test time. LaMer consists of two key components: (i) a cross-episode training framework to encourage exploration and long-term rewards optimization; and (ii) in-context policy adaptation via reflection, allowing the agent to adapt their policy from task feedback signal without gradient update. Experiments across diverse environments show that LaMer significantly improves performance over RL baselines, with 11%, 14%, and 19% performance gains on Sokoban, MineSweeper and Webshop, respectively. Moreover, LaMer also demonstrates better generalization to more challenging or previously unseen tasks compared to the RL-trained agents. Overall, our results demonstrate that Meta-RL provides a principled approach to induce exploration in language agents, enabling more robust adaptation to novel environments through learned exploration strategies.

元强化学习驱动语言智能体的探索行为

Meta-RL Induces Exploration in Language Agents

摘要

Support