元强化学习驱动语言智能体实现自主探索
Meta-RL Induces Exploration in Language Agents
December 18, 2025
作者: Yulun Jiang, Liangze Jiang, Damien Teney, Michael Moor, Maria Brbic
cs.AI
摘要
强化学习(RL)已能训练大语言模型(LLM)智能体与环境交互,以解决多轮次长周期任务。然而,经过RL训练的智能体在需要主动探索的任务中往往表现不佳,且难以通过试错经验实现高效适应。本文提出LaMer——一种通用元强化学习框架,使LLM智能体能在测试阶段主动探索并学习环境反馈。该框架包含两个核心组件:(i)跨轮次训练机制,以激励探索并优化长期回报;(ii)基于反思的上下文策略自适应,使智能体无需梯度更新即可根据任务反馈信号调整策略。多环境实验表明,LaMer在推箱子、扫雷和在线购物三类任务中分别以11%、14%和19%的性能提升显著优于基线RL方法。此外,相较于RL训练的智能体,LaMer在应对更具挑战性或未见任务时展现出更优的泛化能力。总体而言,我们的研究证明元强化学习为语言智能体提供了诱导探索的理论框架,通过习得的探索策略实现对新环境更强大的适应能力。
English
Reinforcement learning (RL) has enabled the training of large language model (LLM) agents to interact with the environment and to solve multi-turn long-horizon tasks. However, the RL-trained agents often struggle in tasks that require active exploration and fail to efficiently adapt from trial-and-error experiences. In this paper, we present LaMer, a general Meta-RL framework that enables LLM agents to actively explore and learn from the environment feedback at test time. LaMer consists of two key components: (i) a cross-episode training framework to encourage exploration and long-term rewards optimization; and (ii) in-context policy adaptation via reflection, allowing the agent to adapt their policy from task feedback signal without gradient update. Experiments across diverse environments show that LaMer significantly improves performance over RL baselines, with 11%, 14%, and 19% performance gains on Sokoban, MineSweeper and Webshop, respectively. Moreover, LaMer also demonstrates better generalization to more challenging or previously unseen tasks compared to the RL-trained agents. Overall, our results demonstrate that Meta-RL provides a principled approach to induce exploration in language agents, enabling more robust adaptation to novel environments through learned exploration strategies.