에이전트의 첫날: 직장 시나리오에서의 학습, 탐색 및 스케줄링 성능 평가

초록

다양한 모드를 지원하는 대규모 언어 모델(MLLM)의 급속한 발전은 업무 자동화를 촉진해 왔으나, 기존 연구는 정적 환경에서의 성능 상한선에 주로 집점하여 확률론적 실제 배포를 위한 강건성을 간과해왔습니다. 본 연구는 동적 작업 스케줄링, 불확실성 하의 능동적 탐색, 경험 기반의 지속적 학습이라는 세 가지 핵심 과제를 도출합니다. 이러한 격차를 해소하기 위해 우리는 새로운 환경을 지속적으로 탐색하는 "수습" 에이전트를 시뮬레이션하는 동적 평가 환경인 을 제안합니다. 기존 벤치마크와 달리 는 세 가지 차원에서 에이전트를 평가합니다: (1) 다양한 우선순위를 지닌 연속 작업에 대한 상황 인식 스케줄링; (2) 능동적 탐색을 통한 환각 현상 감소를 위한 신중한 정보 획득; (3) 규칙 기반으로 동적으로 생성되는 작업에서 일반화된 전략을 추출하는 지속적 진화. 실험 결과, 최첨단 에이전트들도 동적 환경, 특히 능동적 탐색과 지속적 학습 측면에서 상당한 결함을 보였습니다. 본 연구는 에이전트 신뢰성 평가를 위한 프레임워크를 구축하여 정적 테스트에서 현실적이고 생산 지향적인 시나리오로의 평가 패러다임을 전환합니다. 코드는 https://github.com/KnowledgeXLab/EvoEnv에서 확인할 수 있습니다.

English

The rapid evolution of Multi-modal Large Language Models (MLLMs) has advanced workflow automation; however, existing research mainly targets performance upper bounds in static environments, overlooking robustness for stochastic real-world deployment. We identify three key challenges: dynamic task scheduling, active exploration under uncertainty, and continuous learning from experience. To bridge this gap, we introduce , a dynamic evaluation environment that simulates a "trainee" agent continuously exploring a novel setting. Unlike traditional benchmarks, evaluates agents along three dimensions: (1) context-aware scheduling for streaming tasks with varying priorities; (2) prudent information acquisition to reduce hallucination via active exploration; and (3) continuous evolution by distilling generalized strategies from rule-based, dynamically generated tasks. Experiments show that cutting-edge agents have significant deficiencies in dynamic environments, especially in active exploration and continual learning. Our work establishes a framework for assessing agent reliability, shifting evaluation from static tests to realistic, production-oriented scenarios. Our codes are available at https://github.com/KnowledgeXLab/EvoEnv

에이전트의 첫날: 직장 시나리오에서의 학습, 탐색 및 스케줄링 성능 평가

The Agent's First Day: Benchmarking Learning, Exploration, and Scheduling in the Workplace Scenarios

초록

Support