ChatPaper.aiChatPaper

学会在测试时学习:具有可学习适应策略的语言智能体

Learning to Learn-at-Test-Time: Language Agents with Learnable Adaptation Policies

April 2, 2026
作者: Zhanzhi Lou, Hui Chen, Yibo Li, Qian Wang, Bryan Hooi
cs.AI

摘要

测试时学习(TTL)使语言智能体能够通过推理阶段与环境的反复交互,迭代优化其表现。该机制的核心在于一种自适应策略,它能根据历史交互经验更新执行策略,从而改进后续行为。现有方法依赖于固定的人工设计自适应策略,而非针对下游性能进行优化。我们认为最优的自适应策略应从任务环境中学习获得,而非基于人类直觉手动设计。为此,我们提出元测试时学习框架(Meta-TTL),将有效自适应策略的发现过程构建为双层优化问题。在该框架中,内层循环执行标准TTL流程,评估候选自适应策略帮助智能体在连续任务中修正错误的有效性;外层循环则基于智能体表现,通过进化算法在多样化训练任务分布上进行搜索,迭代优化自适应策略。我们在Jericho和WebArena-Lite基准上,分别针对分布内和分布外场景,使用多种元智能体骨干网络进行评估。实验结果表明,Meta-TTL在两项基准测试中均持续优于人工设计的基线方法,证明优化后的自适应策略能够编码具有可迁移性的策略,其泛化能力超越了训练任务的分布范围。
English
Test-Time Learning (TTL) enables language agents to iteratively refine their performance through repeated interactions with the environment at inference time. At the core of TTL is an adaptation policy that updates the actor policy based on experience from previous episodes, thereby improving future behavior. Existing methods rely on fixed, hand-crafted adaptation policies rather than optimizing them for downstream improvement. We argue that optimal adaptation policies should be learned from task environments, not hand-engineered based on human intuition. To achieve this, we introduce Meta-TTL, a framework that formulates the discovery of effective adaptation policies as a bi-level optimization problem. Within this framework, the inner loop executes the standard TTL process, measuring how effectively a candidate adaptation policy helps an agent correct errors across sequential episodes. Guided by the agent's performance, the outer loop employs evolutionary search over a diverse distribution of training tasks to iteratively refine the adaptation policy. We evaluate Meta-TTL on Jericho and WebArena-Lite across both in-distribution (ID) and out-of-distribution (OOD) settings, using multiple meta-agent backbones. Results on both benchmarks show that Meta-TTL consistently outperforms hand-crafted baselines, suggesting that the optimized adaptation policy encodes transferable strategies that generalize beyond the training task distribution.
PDF71April 8, 2026