ChatPaper.aiChatPaper

ClawMark:面向多轮、多日、多模态协作智能体的现实世界基准

ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents

April 26, 2026
作者: Fanqing Meng, Lingxiao Du, Zijian Wu, Guanzheng Chen, Xiangyan Liu, Jiaqi Liao, Chonghe Jiang, Zhenglin Wan, Jiawei Gu, Pengfei Zhou, Rui Huang, Ziqi Zhao, Shengyuan Ding, Ailing Yu, Bo Peng, Bowei Xia, Hao Sun, Haotian Liang, Ji Xie, Jiajun Chen, Jiajun Song, Liu Yang, Ming Xu, Qionglin Qiu, Runhao Fu, Shengfang Zhai, Shijian Wang, Tengfei Ma, Tianyi Wu, Weiyang Jin, Yan Wang, Yang Dai, Yao Lai, Youwei Shu, Yue Liu, Yunzhuo Hao, Yuwei Niu, Jinkai Huang, Jiayuan Zhuo, Zhennan Shen, Linyu Wu, Cihang Xie, Yuyin Zhou, Jiaheng Zhang, Zeyu Zheng, Mengkang Hu, Michael Qizhe Shieh
cs.AI

摘要

语言模型智能体正日益成为跨多个工作日的持续性协作伙伴。为评估此类工作流,我们构建了新型基准测试,其特点在于:多轮次跨天任务设计、状态可演化的沙盒服务环境,以及基于规则的验证机制。当前版本涵盖13个专业场景的100项任务,在五类状态化沙盒服务(文件系统、电子邮件、日历、知识库、电子表格)中执行,并通过1537个确定性Python检查器对执行后的服务状态进行评分——全程未使用LLM作为评判工具。我们对七种前沿智能体系统进行测试,最强模型加权得分达75.8%,但严格任务完成率最高仅为20.0%,表明部分任务进展常见而端到端工作流完整执行仍属罕见。轮次分析显示,性能在首次外部环境更新后显著下降,凸显出适应动态状态是当前核心挑战。我们公开基准测试、评估框架及构建流程,以支持可复现的协作智能体评估。
English
Language-model agents are increasingly used as persistent coworkers that assist users across multiple working days. During such workflows, the surrounding environment may change independently of the agent: new emails arrive, calendar entries shift, knowledge-base records are updated, and evidence appears across images, scanned PDFs, audio, video, and spreadsheets. Existing benchmarks do not adequately evaluate this setting because they typically run within a single static episode and remain largely text-centric. We introduce , a benchmark for coworker agents built around multi-turn multi-day tasks, a stateful sandboxed service environment whose state evolves between turns, and rule-based verification. The current release contains 100 tasks across 13 professional scenarios, executed against five stateful sandboxed services (filesystem, email, calendar, knowledge base, spreadsheet) and scored by 1537 deterministic Python checkers over post-execution service state; no LLM-as-judge is invoked during scoring. We benchmark seven frontier agent systems. The strongest model reaches 75.8 weighted score, but the best strict Task Success is only 20.0\%, indicating that partial progress is common while complete end-to-end workflow completion remains rare. Turn-level analysis shows that performance drops after the first exogenous environment update, highlighting adaptation to changing state as a key open challenge. We release the benchmark, evaluation harness, and construction pipeline to support reproducible coworker-agent evaluation.