Claw-Eval-Live: 실시간 에이전트를 위한 실세계 진화 워크플로우 벤치마크

초록

LLM 에이전트는 소프트웨어 도구, 비즈니스 서비스, 로컬 작업 공간에 걸쳐 종단간 작업 단위를 완료할 것으로 기대됩니다. 그러나 많은 에이전트 벤치마크는 선별된 작업 세트를 출시 시점에 고정시키고 주로 최종 응답만을 평가하여, 진화하는 워크플로우 수요에 대한 에이전트 성능 평가나 작업 실행 여부 검증을 어렵게 만듭니다. 우리는 Claw-Eval-Live를 소개합니다. 이는 라이브 워크플로우 에이전트 벤치마크로, 공개 워크플로우 수요 신호를 통해 출시마다 갱신되는 갱신 가능한 신호 계층과 재현 가능한 타임스탬프 출시 스냅샷을 분리합니다. 각 출시 버전은 공개 워크플로우 수요 신호로부터 구성되며, 해당 출시판에 사용된 ClawHub Top-500 스킬을 포함하고, 고정된 픽스처, 서비스, 작업 공간, 평가자로 구성된 통제된 작업으로 구체화됩니다. 평가를 위해 Claw-Eval-Live는 실행 흔적, 감사 로그, 서비스 상태, 실행 후 작업 공간 산출물을 기록하며, 증거가 충분할 때는 결정론적 검사를 사용하고 의미론적 차원에 대해서만 구조화된 LLM 평가를 활용합니다. 이 출시판은 통제된 비즈니스 서비스와 로컬 작업 공간 수리 분야를 아우르는 105개 작업을 포함하며, 공통 공개 합격 기준 아래 13개의 최신 모델을 평가합니다. 실험 결과, 신뢰할 수 있는 워크플로우 자동화는 여전히 해결되지 않은 과제로 남아 있음이 드러납니다: 선두 모델의 작업 합격률은 66.7%에 그치며 어떤 모델도 70%에 도달하지 못했습니다. 실패는 작업 계열과 실행 환경별로 구조화되어 있으며, HR, 관리, 다중 시스템 비즈니스 워크플로우는 지속적인 병목 현상으로 나타났고, 로컬 작업 공간 수리는 상대적으로 쉬우나 포화 상태에 이르지 못했습니다. 순위표 상의 순위만으로는 부족한데, 유사한 합격률을 보이는 모델들도 전체 완료율에서는 차이를 보이며, 작업 수준 변별력은 중간 수준의 작업들에 집중되기 때문입니다. Claw-Eval-Live는 워크플로우 에이전트 평가가 최신 외부 수요와 검증 가능한 에이전트 행위라는 두 가지 근거에 기반해야 함을 시사합니다.

English

LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and grade mainly the final response, making it difficult to evaluate agents against evolving workflow demand or verify whether a task was executed. We introduce Claw-Eval-Live, a live benchmark for workflow agents that separates a refreshable signal layer, updated across releases from public workflow-demand signals, from a reproducible, time-stamped release snapshot. Each release is constructed from public workflow-demand signals, with ClawHub Top-500 skills used in the current release, and materialized as controlled tasks with fixed fixtures, services, workspaces, and graders. For grading, Claw-Eval-Live records execution traces, audit logs, service state, and post-run workspace artifacts, using deterministic checks when evidence is sufficient and structured LLM judging only for semantic dimensions. The release contains 105 tasks spanning controlled business services and local workspace repair, and evaluates 13 frontier models under a shared public pass rule. Experiments reveal that reliable workflow automation remains far from solved: the leading model passes only 66.7% of tasks and no model reaches 70%. Failures are structured by task family and execution surface, with HR, management, and multi-system business workflows as persistent bottlenecks and local workspace repair comparatively easier but unsaturated. Leaderboard rank alone is insufficient because models with similar pass rates can diverge in overall completion, and task-level discrimination concentrates in a middle band of tasks. Claw-Eval-Live suggests that workflow-agent evaluation should be grounded twice, in fresh external demand and in verifiable agent action.

Claw-Eval-Live: 실시간 에이전트를 위한 실세계 진화 워크플로우 벤치마크

Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

초록

Support