컴퓨터 사용 에이전트의 신뢰성에 관한 연구

초록

컴퓨터 사용 에이전트는 웹 탐색, 데스크톱 자동화, 소프트웨어 상호작용과 같은 실제 작업에서 빠르게 발전하여 경우에 따라 인간의 성능을 능가하기도 합니다. 그러나 동일한 작업과 모델이 유지되더라도 한 번 성공한 에이전트가 동일한 작업을 반복 실행할 때 실패할 수 있습니다. 이는 근본적인 질문을 제기합니다: 에이전트가 한 번 작업을 성공할 수 있다면, 무엇이 안정적으로 성공하는 것을 방해하는 것일까요? 본 연구에서는 실행 중 확률성, 작업 명세의 모호성, 에이전트 행동의 변동성이라는 세 가지 요인을 통해 컴퓨터 사용 에이전트의 불안정성 원인을 분석합니다. 우리는 OSWorld에서 동일한 작업을 반복 실행하고 설정 간 작업 수준 변화를 포착하는 쌍체 통계 검정을 통해 이러한 요인들을 분석합니다. 우리의 분석은 안정성이 작업이 어떻게 명세되는지와 실행 간 에이전트 행동이 어떻게 변하는지 모두에 의존함을 보여줍니다. 이러한 결과는 반복 실행 하에서 에이전트를 평가할 필요성, 에이전트가 상호작용을 통해 작업 모호성을 해결할 수 있도록 허용할 필요성, 그리고 실행 간 안정성을 유지하는 전략을 선호할 필요성을 시사합니다.

English

Computer-use agents have rapidly improved on real-world tasks such as web navigation, desktop automation, and software interaction, in some cases surpassing human performance. Yet even when the task and model are unchanged, an agent that succeeds once may fail on a repeated execution of the same task. This raises a fundamental question: if an agent can succeed at a task once, what prevents it from doing so reliably? In this work, we study the sources of unreliability in computer-use agents through three factors: stochasticity during execution, ambiguity in task specification, and variability in agent behavior. We analyze these factors on OSWorld using repeated executions of the same task together with paired statistical tests that capture task-level changes across settings. Our analysis shows that reliability depends on both how tasks are specified and how agent behavior varies across executions. These findings suggest the need to evaluate agents under repeated execution, to allow agents to resolve task ambiguity through interaction, and to favor strategies that remain stable across runs.

컴퓨터 사용 에이전트의 신뢰성에 관한 연구

On the Reliability of Computer Use Agents

초록

Support