ClawsBench:模拟工作空间中大型语言模型生产力代理的能力与安全性评估
ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces
April 6, 2026
作者: Xiangyi Li, Kyoung Whan Choe, Yimin Liu, Xiaokun Chen, Chujun Tao, Bingran You, Wenbo Chen, Zonglin Di, Jiankai Sun, Shenghan Zheng, Jiajun Bao, Yuanli Wang, Weixiang Yan, Yiyuan Li, Han-chung Lee
cs.AI
摘要
大型语言模型(LLM)智能体正被日益广泛地应用于自动化生产力任务(如邮件处理、日程安排、文档管理),但在真实服务环境中对其评估存在风险,因其可能造成不可逆的变更。现有基准测试依赖简化环境,难以捕捉真实场景中具有状态持久性的多服务工作流。我们推出ClawsBench——一个用于在真实生产力场景中评估和改进LLM智能体的基准测试平台。该平台包含五个高仿真模拟服务(Gmail、Slack、Google日历、Google文档、Google云端硬盘),具备完整的状态管理和确定性快照/恢复功能,并提供44项涵盖单服务、跨服务及安全关键场景的结构化任务。我们将智能体框架分解为两个独立控制维度(通过渐进式提示注入API知识的领域技能,以及协调跨服务行为的元提示),并通过调节这两个维度来衡量其单独及联合效应。在6种模型、4种智能体框架和33种条件下的实验表明:配备完整框架的智能体任务成功率可达39-64%,但存在7-33%的不安全操作率。在OpenClaw测试中,前五名模型的任务成功率集中在10个百分点区间(53-63%),不安全操作率介于7%-23%之间,且两项指标未呈现稳定关联性。我们识别出八类典型的不安全行为模式,包括多步沙箱权限提升和静默合约修改等。
English
Large language model (LLM) agents are increasingly deployed to automate productivity tasks (e.g., email, scheduling, document management), but evaluating them on live services is risky due to potentially irreversible changes. Existing benchmarks rely on simplified environments and fail to capture realistic, stateful, multi-service workflows. We introduce ClawsBench, a benchmark for evaluating and improving LLM agents in realistic productivity settings. It includes five high-fidelity mock services (Gmail, Slack, Google Calendar, Google Docs, Google Drive) with full state management and deterministic snapshot/restore, along with 44 structured tasks covering single-service, cross-service, and safety-critical scenarios. We decompose agent scaffolding into two independent levers (domain skills that inject API knowledge via progressive disclosure, and a meta prompt that coordinates behavior across services) and vary both to measure their separate and combined effects. Experiments across 6 models, 4 agent harnesses, and 33 conditions show that with full scaffolding, agents achieve task success rates of 39-64% but exhibit unsafe action rates of 7-33%. On OpenClaw, the top five models fall within a 10 percentage-point band on task success (53-63%), with unsafe action rates from 7% to 23% and no consistent ordering between the two metrics. We identify eight recurring patterns of unsafe behavior, including multi-step sandbox escalation and silent contract modification.