GUI対CLI：画面のみおよびスキル媒介型コンピュータ利用エージェントにおける実行のボトルネック

要旨

コンピュータ利用エージェントは、グラフィカルインターフェースまたはプログラムによるコマンドインターフェースを通じてソフトウェアタスクを実行できるが、既存の評価手法では、操作モダリティと、タスク、初期状態、検証手段、許可された操作の差異とが混同されている。本研究では、18のアプリケーションと12のワークフローカテゴリにわたる440のデスクトップタスクからなる、実行レイヤーを一致させたベンチマークを導入する。このベンチマークでは、画面のみのGUIエージェントとスキル媒介型CLIエージェントが、同一の目標、状態、最終状態検証手段を与えられつつ、各モダリティ固有の操作に制限される。この統制された設定において、最も強力なGUIエージェントは59.1%の完全合格率を達成し、最も強力なオリジナルスキルのCLIエージェントの48.2%を上回った。しかし、検証手段によるスキル拡張により、CLIの成功率は69.3%に上昇し、CLIの欠点の多くがモデル能力のみに起因するのではなく、スキルカバレッジの不完全さにあることが示された。これらの結果は、GUIとCLIが異なる実行上のボトルネックを露呈することを示唆している。すなわち、GUIエージェントは長期的なワークフローにおける信頼性の高い接地型インタラクションによって制限される一方、CLIエージェントはスキルインターフェースのカバレッジと拡張性によって制限される。

English

Computer-use agents can execute software tasks through either graphical interfaces or programmatic command interfaces, but existing evaluations confound interaction modality with differences in tasks, initial states, verifiers, and permitted actions. We introduce a matched execution-layer benchmark of 440 desktop tasks across 18 applications and 12 workflow categories, where screen-only GUI agents and skill-mediated CLI agents receive identical goals, states, and final-state verifiers while being restricted to modality-native actions. In this controlled setting, the strongest GUI agent reaches a 59.1% full pass rate, outperforming the strongest original-skill CLI agent at 48.2%; however, verifier-guided skill augmentation raises CLI success to 69.3%, showing that much of the CLI deficit comes from incomplete skill coverage rather than model capability alone. These results suggest that GUI and CLI expose different execution bottlenecks: GUI agents are limited by reliable grounded interaction over long-horizon workflows, whereas CLI agents are limited by the coverage and scalability of their skill interfaces.