EEVEE：邁向真實世界中自我改善代理的測試時提示學習

摘要

本文提出EEVEE，首個針對大型語言模型智能體的多數據集測試時提示學習框架，能在真實任務串流中實現測試時提示學習。現有方法主要針對單一數據集設計，然而真實應用場景要求模型處理來自多個數據集、領域及任務分佈的異質輸入串流，限制了其實用性。為減輕跨數據集干擾，EEVEE引入一個路由器，將輸入分割為任務聚類，並指派至合適的提示配置。此設計透過路由器-提示共同演化策略進行優化，該策略採用交錯的路由器與提示學習階段，以處理兩者間的相互依賴關係。在多個數據集上的實驗顯示，該框架在異質數據串流下提升穩健性，同時維持單標桿學習能力與效率。具體而言，EEVEE在Qwen3-4B-Instruct與DeepSeek-V3.2上分別將平均多標桿分數提升10.38分與24.32分，相較於SOTA方法GEPA與ACE，最高提升幅度達37.2%與48.2%。

English

In this paper, we propose EEVEE, the first multi-dataset test-time prompt learning framework for LLM agents, enabling test-time prompt learning under real-world task streams. Existing methods are largely designed for single-dataset settings, while real-world applications require models to handle heterogeneous input streams drawn from multiple datasets, domains, and task distributions, limiting their practical applicability. To mitigate cross-dataset interference, EEVEE introduces a router that partitions incoming inputs into task clusters and assigns them to suitable prompt configurations. This design is optimized via a router-prompt co-evolution strategy, which employs interleaved router and prompt learning phases to address their mutual dependency. Experiments across multiple datasets demonstrate that the framework improves robustness under heterogeneous data streams while maintaining single-benchmark learning capability and efficiency. Specifically, EEVEE improves average multi-benchmark scores by 10.38 and 24.32 points over Qwen3-4B-Instruct and DeepSeek-V3.2, surpassing SOTA methods GEPA and ACE by up to 37.2% and 48.2%.