EEVEE: 実世界における自己改善エージェントのためのテスト時プロンプト学習に向けて

要旨

本論文では、LLMエージェント向けに初のマルチデータセットテスト時プロンプト学習フレームワークであるEEVEEを提案する。これにより、実世界のタスクストリーム下でのテスト時プロンプト学習が可能となる。既存手法は主に単一データセットの設定向けに設計されているが、実世界のアプリケーションでは複数のデータセット、ドメイン、タスク分布から得られる異種入力ストリームを処理する必要があり、実用性に制約がある。この問題に対処するため、EEVEEはルーターを導入し、入力データをタスククラスタに分割し、適切なプロンプト設定に割り当てる。この設計は、ルーターとプロンプトの共進化戦略により最適化され、相互依存性に対処するためにルーター学習フェーズとプロンプト学習フェーズを交互に実行する。複数のデータセットを用いた実験により、本フレームワークは異種データストリーム下での堅牢性を向上させつつ、単一ベンチマークでの学習能力と効率を維持することを示す。具体的には、EEVEEはQwen3-4B-InstructおよびDeepSeek-V3.2と比較して、平均マルチベンチマークスコアをそれぞれ10.38ポイントおよび24.32ポイント向上させ、SOTA手法であるGEPAおよびACEを最大37.2%および48.2%上回る。

English

In this paper, we propose EEVEE, the first multi-dataset test-time prompt learning framework for LLM agents, enabling test-time prompt learning under real-world task streams. Existing methods are largely designed for single-dataset settings, while real-world applications require models to handle heterogeneous input streams drawn from multiple datasets, domains, and task distributions, limiting their practical applicability. To mitigate cross-dataset interference, EEVEE introduces a router that partitions incoming inputs into task clusters and assigns them to suitable prompt configurations. This design is optimized via a router-prompt co-evolution strategy, which employs interleaved router and prompt learning phases to address their mutual dependency. Experiments across multiple datasets demonstrate that the framework improves robustness under heterogeneous data streams while maintaining single-benchmark learning capability and efficiency. Specifically, EEVEE improves average multi-benchmark scores by 10.38 and 24.32 points over Qwen3-4B-Instruct and DeepSeek-V3.2, surpassing SOTA methods GEPA and ACE by up to 37.2% and 48.2%.