GUI vs. CLI：仅屏幕型与技能中介型计算机使用智能体中的执行瓶颈

摘要

计算机使用型智能体能够通过图形界面或程序化命令界面执行软件任务，但现有评估体系混淆了交互模态与任务、初始状态、验证器及允许操作之间的差异。我们引入了一个包含18款应用程序、12个工作流类别、共440项桌面任务的匹配执行层基准测试，在该测试中，纯屏幕GUI智能体与技能中介型CLI智能体接收完全相同的目标、状态和最终状态验证器，并严格限制使用其模态原生操作。在这种受控条件下，最强的GUI智能体达到59.1%的完全通过率，优于最强原始技能CLI智能体的48.2%；然而，经验证器引导的技能增强使CLI成功率提升至69.3%，这表明CLI的缺陷主要源于技能覆盖不全，而非单纯模型能力不足。这些结果表明，GUI与CLI暴露出不同的执行瓶颈：GUI智能体受限于长周期工作流中可靠的实体交互能力，而CLI智能体则受限于其技能界面的覆盖范围与可扩展性。

English

Computer-use agents can execute software tasks through either graphical interfaces or programmatic command interfaces, but existing evaluations confound interaction modality with differences in tasks, initial states, verifiers, and permitted actions. We introduce a matched execution-layer benchmark of 440 desktop tasks across 18 applications and 12 workflow categories, where screen-only GUI agents and skill-mediated CLI agents receive identical goals, states, and final-state verifiers while being restricted to modality-native actions. In this controlled setting, the strongest GUI agent reaches a 59.1% full pass rate, outperforming the strongest original-skill CLI agent at 48.2%; however, verifier-guided skill augmentation raises CLI success to 69.3%, showing that much of the CLI deficit comes from incomplete skill coverage rather than model capability alone. These results suggest that GUI and CLI expose different execution bottlenecks: GUI agents are limited by reliable grounded interaction over long-horizon workflows, whereas CLI agents are limited by the coverage and scalability of their skill interfaces.