コンピュータ利用エージェントの信頼性について

要旨

コンピュータ利用エージェントは、ウェブナビゲーション、デスクトップ自動化、ソフトウェア操作といった現実世界のタスクにおいて急速に進化し、場合によっては人間のパフォーマンスを凌駕するまでになっている。しかし、タスクとモデルが変わらない状況であっても、一度成功したエージェントが同じタスクを繰り返し実行した際に失敗することがある。これは根本的な疑問を提起する：もしエージェントが一度タスクを成功させられるのであれば、何が信頼性の高い遂行を妨げているのか？本研究では、コンピュータ利用エージェントの信頼性低下の要因を、（1）実行中の確率性、（2）タスク指定の曖昧さ、（3）エージェント挙動の変動性、という3つの要素を通じて検証する。OSWorld環境において同一タスクを反復実行し、設定間のタスクレベルの変化を捉える対応のある統計検定を用いてこれらの要素を分析する。分析結果から、信頼性はタスクの指定方法とエージェントの挙動が実行間でどのように変動するかの両方に依存することが明らかとなった。これらの知見は、エージェントを反復実行条件下で評価すること、エージェントが対話を通じてタスクの曖昧さを解消できるようにすること、および実行間で安定した戦略を優先することの必要性を示唆している。

English

Computer-use agents have rapidly improved on real-world tasks such as web navigation, desktop automation, and software interaction, in some cases surpassing human performance. Yet even when the task and model are unchanged, an agent that succeeds once may fail on a repeated execution of the same task. This raises a fundamental question: if an agent can succeed at a task once, what prevents it from doing so reliably? In this work, we study the sources of unreliability in computer-use agents through three factors: stochasticity during execution, ambiguity in task specification, and variability in agent behavior. We analyze these factors on OSWorld using repeated executions of the same task together with paired statistical tests that capture task-level changes across settings. Our analysis shows that reliability depends on both how tasks are specified and how agent behavior varies across executions. These findings suggest the need to evaluate agents under repeated execution, to allow agents to resolve task ambiguity through interaction, and to favor strategies that remain stable across runs.

コンピュータ利用エージェントの信頼性について

On the Reliability of Computer Use Agents

要旨

Support