Workspace-Bench 1.0：基于大规模文件依赖工作空间任务的AI智能体基准测试框架

摘要

工作空间学习要求AI智能体能够识别、推理、利用并更新工作者工作空间中异构文件间的显性与隐性依赖关系，从而有效完成常规及高阶任务。尽管该能力至关重要，现有相关基准大多基于预设或合成文件对智能体进行评估，其现实依赖关系有限，导致工作空间层面的评估研究尚不充分。为此，我们推出Workspace-Bench基准测试框架，旨在评估AI智能体在涉及大规模文件依赖的工作空间学习中的表现。我们构建了包含5类工作者画像、74种文件类型、20,476个文件（最大达20GB）的拟真工作空间，精心设计了388项任务（每项任务均配有专属文件依赖关系图），并通过7,399条评估细则对智能体的跨文件检索、上下文推理及自适应决策能力进行综合测评。我们还提供包含100项任务的轻量版Workspace-Bench-Lite，在保持基准分布特征的同时将评估成本降低约70%。通过对4种主流智能体框架和7种基础模型的测试，实验结果表明当前智能体尚无法实现可靠的工作空间学习——最佳模型仅达到68.7%的准确率，显著低于人类80.7%的表现，且所有智能体的平均准确率仅为47.4%。

English

Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively. Despite its importance, existing relevant benchmarks largely evaluate agents on pre-specified or synthesized files with limited real-world dependencies, leaving workspace-level evaluation underexplored. To this end, we introduce Workspace-Bench, a benchmark for evaluating AI agents on Workspace Learning invOlving Large-Scale File Dependencies. We construct realistic workspaces with 5 worker profiles, 74 file types, 20,476 files (up to 20GB) and curate 388 tasks, each with its own file dependency graph, evaluated across 7,399 total rubrics that require cross-file retrieval, contextual reasoning, and adaptive decision-making. We further provide Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation costs by about 70%. We evaluate 4 popular agent harnesses and 7 foundation models. Experimental results show that current agents remain far from reliable workspace learning, where the best reaches only 68.7%, substantially below the human result of 80.7%, and the average performance across agents is only 47.4%.

Workspace-Bench 1.0：基于大规模文件依赖工作空间任务的AI智能体基准测试框架

Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

摘要

Support