EEVEE: 실제 환경에서 자기 개선 에이전트를 위한 테스트 시점 프롬프트 학습

초록

본 논문에서는 EEVEE를 제안한다. 이는 LLM 에이전트를 위한 최초의 다중 데이터셋 테스트 시점 프롬프트 학습 프레임워크로, 실제 작업 스트림 환경에서 테스트 시점 프롬프트 학습을 가능하게 한다. 기존 방법들은 대부분 단일 데이터셋 환경을 위해 설계되었으나, 실제 응용에서는 여러 데이터셋, 도메인, 작업 분포로부터 도출된 이질적 입력 스트림을 처리해야 하므로 실용성이 제한된다. 교차 데이터셋 간섭을 완화하기 위해, EEVEE는 라우터를 도입하여 들어오는 입력을 작업 클러스터로 분할하고 적절한 프롬프트 구성에 할당한다. 이 설계는 라우터-프롬프트 공동 진화 전략을 통해 최적화되며, 상호 의존성을 해결하기 위해 라우터 학습과 프롬프트 학습 단계를 교차로 수행한다. 여러 데이터셋에 걸친 실험 결과, 이 프레임워크는 이질적 데이터 스트림 하에서 강건성을 향상시키면서도 단일 벤치마크 학습 능력과 효율성을 유지함을 보여준다. 구체적으로, EEVEE는 Qwen3-4B-Instruct 및 DeepSeek-V3.2 대비 평균 다중 벤치마크 점수를 각각 10.38, 24.32점 향상시켰으며, 최신 기법인 GEPA 및 ACE 대비 최대 37.2%, 48.2%의 성능 개선을 달성했다.

English

In this paper, we propose EEVEE, the first multi-dataset test-time prompt learning framework for LLM agents, enabling test-time prompt learning under real-world task streams. Existing methods are largely designed for single-dataset settings, while real-world applications require models to handle heterogeneous input streams drawn from multiple datasets, domains, and task distributions, limiting their practical applicability. To mitigate cross-dataset interference, EEVEE introduces a router that partitions incoming inputs into task clusters and assigns them to suitable prompt configurations. This design is optimized via a router-prompt co-evolution strategy, which employs interleaved router and prompt learning phases to address their mutual dependency. Experiments across multiple datasets demonstrate that the framework improves robustness under heterogeneous data streams while maintaining single-benchmark learning capability and efficiency. Specifically, EEVEE improves average multi-benchmark scores by 10.38 and 24.32 points over Qwen3-4B-Instruct and DeepSeek-V3.2, surpassing SOTA methods GEPA and ACE by up to 37.2% and 48.2%.