ClawsBench: 시뮬레이션된 작업 공간에서 LLM 생산성 에이전트의 능력과 안전성 평가

초록

대규모 언어 모델(LLM) 에이전트는 생산성 업무(예: 이메일, 일정 관리, 문서 관리) 자동화를 위해 점점 더 많이 배포되고 있지만, 실제 서비스에서 평가하는 것은 돌이킬 수 없는 변경 가능성으로 인해 위험합니다. 기존 벤치마크는 단순화된 환경에 의존하며 현실적이고 상태 유지가 가능한 다중 서비스 워크플로를 제대로 반영하지 못합니다. 본 논문에서는 현실적인 생산성 환경에서 LLM 에이전트 평가 및 개선을 위한 벤치마크인 ClawsBench을 소개합니다. ClawsBench은 완전한 상태 관리와 결정론적 스냅샷/복원 기능을 갖춘 5개의 고충실도 모의 서비스(Gmail, Slack, Google Calendar, Google Docs, Google Drive)와 단일 서비스, 교차 서비스, 안전 중대 시나리오를 아우르는 44개의 구조화된 작업으로 구성됩니다. 우리는 에이전트 스캐폴딩을 두 개의 독립적인 조절 장치(점진적 공개를 통해 API 지식을 주입하는 도메인 기술, 서비스 간 행동을 조정하는 메타 프롬프트)로 분해하고 이를 각각 변화시켜 개별 및 결합 효과를 측정합니다. 6개 모델, 4개 에이전트 실행 프레임워크, 33개 조건에 걸친 실험 결과, 완전한 스캐폴딩을 적용하면 에이전트가 39-64%의 작업 성공률을 달성하지만 7-33%의 안전하지 않은 행동 비율을 보입니다. OpenClaw에서 상위 5개 모델은 작업 성공률에서 10% 포인트 범위(53-63%) 내에 분포하며, 안전하지 않은 행동 비율은 7%에서 23% 사이였고 두 지표 간 일관된 순위 관계는 관찰되지 않았습니다. 우리는 다단계 샌드박스 권한 상승 및 무음 계약 수정을 포함한 8가지 반복적인 안전 위반 행동 패턴을 식별합니다.

English

Large language model (LLM) agents are increasingly deployed to automate productivity tasks (e.g., email, scheduling, document management), but evaluating them on live services is risky due to potentially irreversible changes. Existing benchmarks rely on simplified environments and fail to capture realistic, stateful, multi-service workflows. We introduce ClawsBench, a benchmark for evaluating and improving LLM agents in realistic productivity settings. It includes five high-fidelity mock services (Gmail, Slack, Google Calendar, Google Docs, Google Drive) with full state management and deterministic snapshot/restore, along with 44 structured tasks covering single-service, cross-service, and safety-critical scenarios. We decompose agent scaffolding into two independent levers (domain skills that inject API knowledge via progressive disclosure, and a meta prompt that coordinates behavior across services) and vary both to measure their separate and combined effects. Experiments across 6 models, 4 agent harnesses, and 33 conditions show that with full scaffolding, agents achieve task success rates of 39-64% but exhibit unsafe action rates of 7-33%. On OpenClaw, the top five models fall within a 10 percentage-point band on task success (53-63%), with unsafe action rates from 7% to 23% and no consistent ordering between the two metrics. We identify eight recurring patterns of unsafe behavior, including multi-step sandbox escalation and silent contract modification.

ClawsBench: 시뮬레이션된 작업 공간에서 LLM 생산성 에이전트의 능력과 안전성 평가

ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

초록

Support