智能體的首日職場表現:工作場景中的學習、探索與排程基準測試
The Agent's First Day: Benchmarking Learning, Exploration, and Scheduling in the Workplace Scenarios
January 13, 2026
作者: Daocheng Fu, Jianbiao Mei, Rong Wu, Xuemeng Yang, Jia Xu, Ding Wang, Pinlong Cai, Yong Liu, Licheng Wen, Botian Shi
cs.AI
摘要
多模態大型語言模型(MLLMs)的快速演進推動了工作流程自動化的發展,然而現有研究主要聚焦於靜態環境下的性能上限,忽略了隨機現實部署所需的穩健性。我們識別出三大關鍵挑戰:動態任務排程、不確定性下的主動探索,以及從經驗中持續學習的能力。為彌合此差距,我們推出動態評估環境,該環境模擬一名「實習」代理在全新場景中持續探索的過程。有別於傳統基準測試,本環境從三個維度評估代理能力:(1) 針對不同優先級的串流任務進行情境感知排程;(2) 通過主動探索實施審慎資訊獲取以減少幻覺;(3) 從基於規則動態生成的任務中提煉通用策略,實現持續演進。實驗表明,尖端代理在動態環境中存在顯著缺陷,尤其在主動探索與持續學習方面。本研究建立了一套評估代理可靠性的框架,將測試重心從靜態評估轉向貼近實際生產場景的動態驗證。相關程式碼已開源於:https://github.com/KnowledgeXLab/EvoEnv
English
The rapid evolution of Multi-modal Large Language Models (MLLMs) has advanced workflow automation; however, existing research mainly targets performance upper bounds in static environments, overlooking robustness for stochastic real-world deployment. We identify three key challenges: dynamic task scheduling, active exploration under uncertainty, and continuous learning from experience. To bridge this gap, we introduce , a dynamic evaluation environment that simulates a "trainee" agent continuously exploring a novel setting. Unlike traditional benchmarks, evaluates agents along three dimensions: (1) context-aware scheduling for streaming tasks with varying priorities; (2) prudent information acquisition to reduce hallucination via active exploration; and (3) continuous evolution by distilling generalized strategies from rule-based, dynamically generated tasks. Experiments show that cutting-edge agents have significant deficiencies in dynamic environments, especially in active exploration and continual learning. Our work establishes a framework for assessing agent reliability, shifting evaluation from static tests to realistic, production-oriented scenarios. Our codes are available at https://github.com/KnowledgeXLab/EvoEnv