ChatPaper.aiChatPaper

智能体首日职场表现评估:学习、探索与任务调度能力基准测试

The Agent's First Day: Benchmarking Learning, Exploration, and Scheduling in the Workplace Scenarios

January 13, 2026
作者: Daocheng Fu, Jianbiao Mei, Rong Wu, Xuemeng Yang, Jia Xu, Ding Wang, Pinlong Cai, Yong Liu, Licheng Wen, Botian Shi
cs.AI

摘要

多模态大语言模型(MLLMs)的快速发展推动了工作流自动化进程,然而现有研究主要聚焦静态环境下的性能上限,忽视了随机现实部署所需的鲁棒性。我们识别出三大核心挑战:动态任务调度、不确定性下的主动探索以及基于经验的持续学习。为弥补这一空白,我们推出动态评估环境,该环境模拟"受训"智能体持续探索新场景的过程。与传统基准测试不同,该环境从三个维度评估智能体:(1)针对不同优先级流式任务的上下文感知调度能力;(2)通过主动探索审慎获取信息以减少幻觉的决策水平;(3)从基于规则动态生成的任务中提炼通用策略的持续进化能力。实验表明,前沿智能体在动态环境中存在显著缺陷,尤其在主动探索与持续学习方面。本研究建立了评估智能体可靠性的框架,将测试重点从静态基准转向贴近实际生产的场景。代码已开源:https://github.com/KnowledgeXLab/EvoEnv
English
The rapid evolution of Multi-modal Large Language Models (MLLMs) has advanced workflow automation; however, existing research mainly targets performance upper bounds in static environments, overlooking robustness for stochastic real-world deployment. We identify three key challenges: dynamic task scheduling, active exploration under uncertainty, and continuous learning from experience. To bridge this gap, we introduce , a dynamic evaluation environment that simulates a "trainee" agent continuously exploring a novel setting. Unlike traditional benchmarks, evaluates agents along three dimensions: (1) context-aware scheduling for streaming tasks with varying priorities; (2) prudent information acquisition to reduce hallucination via active exploration; and (3) continuous evolution by distilling generalized strategies from rule-based, dynamically generated tasks. Experiments show that cutting-edge agents have significant deficiencies in dynamic environments, especially in active exploration and continual learning. Our work establishes a framework for assessing agent reliability, shifting evaluation from static tests to realistic, production-oriented scenarios. Our codes are available at https://github.com/KnowledgeXLab/EvoEnv
PDF92February 8, 2026