WeaveBench：面向混合界面的计算机使用智能体的长时域现实世界基准

摘要

计算机使用代理（CUAs）越来越多地在结合了可视化桌面控制、命令行执行、代码编辑、浏览器及外部工具的运行时环境中运行。然而，现有基准测试往往将这些界面作为独立能力进行评估，导致跨界面的长周期协同操作测试不足。为此，我们提出了WeaveBench——一个长周期混合界面基准测试，包含114项任务，覆盖8个真实工作领域，所有任务均基于真实用户请求和可公开验证的工件。每项任务要求代理在单一轨迹中整合GUI观察/操作与CLI/代码操作。我们在已部署的CLI代理运行时环境中的真实Ubuntu桌面上评估这些任务，并为其添加了轻量级桌面控制插件。我们还提出了一种配套的轨迹感知评判器，用于检查交付物、文件、截图、日志和操作痕迹，同时检测诸如伪造视觉证据或硬编码指标等捷径行为。在前沿模型与运行时组合中，最佳通过率仅为41.2%，表明该基准测试远未饱和。轨迹感知评判器进一步揭示，仅基于结果的评分会大幅高估代理性能。总体而言，WeaveBench揭示了CUA评估中的关键缺口，并为衡量代理能否在长周期真实世界任务中协调GUI、CLI和代码操作提供了有效测试平台。

English

Computer-use agents (CUAs) increasingly operate in runtimes that combine visual desktop control, command-line execution, code editing, browsers, and external tools. Existing benchmarks, however, often evaluate these interfaces as separable capabilities, leaving long-horizon cross-interface orchestration under-tested. Thus, we introduce WeaveBench, a long-horizon hybrid-interface benchmark with 114 tasks across 8 real-world work domains, grounded in real user requests and publicly verifiable artifacts. Each task requires agents to combine GUI observations/actions with CLI/code operations within a single trajectory. We evaluate these tasks on a real Ubuntu desktop inside deployed CLI-agent runtimes, augmented with a minimal desktop-control plugin. We also propose a companion trajectory-aware judge that inspects deliverables, files, screenshots, logs, and action traces, while detecting shortcut behaviors such as fabricated visual evidence or hard-coded metrics. Across frontier model-runtime pairings, the best PassRate reaches only 41.2%, showing the benchmark remains far from saturated. The trajectory-aware judge further reveals that outcome-only grading substantially overestimates agent performance. Overall, WeaveBench exposes a critical gap in CUA evaluation and provides an effective testbed to measure whether agents can orchestrate GUI, CLI, and code operations across long-horizon real-world tasks.