EEVEE: 面向现实世界中自改进智能体的测试时提示学习

摘要

本文提出EEVEE——首个面向LLM智能体的多数据集测试时提示学习框架，能够应对真实任务流下的测试时提示学习挑战。现有方法主要针对单数据集场景设计，而实际应用要求模型处理来自多个数据集、领域及任务分布的异构输入流，这限制了它们的实用价值。为缓解跨数据集干扰，EEVEE引入了一个路由器，将输入按任务簇划分并分配给合适的提示配置。该设计通过路由器-提示协同进化策略进行优化，该策略采用交替的路由器与提示学习阶段，以解决两者的相互依赖问题。跨多个数据集的实验表明，该框架在保持单基准学习能力与效率的同时，提升了异构数据流下的鲁棒性。具体而言，相比Qwen3-4B-Instruct和DeepSeek-V3.2，EEVEE将多基准平均得分分别提升了10.38和24.32分，较当前最先进的GEPA与ACE方法分别高出最多37.2%和48.2%。

English

In this paper, we propose EEVEE, the first multi-dataset test-time prompt learning framework for LLM agents, enabling test-time prompt learning under real-world task streams. Existing methods are largely designed for single-dataset settings, while real-world applications require models to handle heterogeneous input streams drawn from multiple datasets, domains, and task distributions, limiting their practical applicability. To mitigate cross-dataset interference, EEVEE introduces a router that partitions incoming inputs into task clusters and assigns them to suitable prompt configurations. This design is optimized via a router-prompt co-evolution strategy, which employs interleaved router and prompt learning phases to address their mutual dependency. Experiments across multiple datasets demonstrate that the framework improves robustness under heterogeneous data streams while maintaining single-benchmark learning capability and efficiency. Specifically, EEVEE improves average multi-benchmark scores by 10.38 and 24.32 points over Qwen3-4B-Instruct and DeepSeek-V3.2, surpassing SOTA methods GEPA and ACE by up to 37.2% and 48.2%.