컴퓨터 사용 에이전트를 위한 비디오 기반 보상 모델링

초록

컴퓨터 활용 에이전트(CUA)의 성능이 점차 향상되고 있지만, 사용자 지시를 에이전트 실행 경로가 진정으로 충족하는지 평가하는 작업의 확장은 여전히 어렵습니다. 본 연구에서는 실행 비디오(에이전트 내부 추론이나 행동과 무관한 에이전트 경로의 키프레임 시퀀스)를 이용한 보상 모델링을 연구합니다. 비디오 실행 모델링은 방법론에 독립적이지만, 높은 중복성을 보이는 레이아웃과 성공을 결정하는 미세하고 지역화된 단서를 포함한 주요 과제를 제시합니다. 우리는 53,000개의 고품질 비디오-작업-보상 삼중항으로 구성된 Execution Video Reward 53k(ExeVR-53k) 데이터셋을 소개합니다. 더 나아가 단계별 주석이 있는 부정 샘플을 합성하기 위해 적대적 지시 변환을 제안합니다. 길고 고해상도의 실행 비디오로부터 학습을 가능하게 하기 위해, 우리는 동질적인 영역과 지속적 토큰을 제거하면서 결정적인 UI 변화를 보존하는 시공간 토큰 프루닝을 설계합니다. 이러한 구성 요소를 바탕으로, 사용자 지시와 비디오 실행 시퀀스만을 입력으로 작업 성공을 예측하는 Execution Video Reward Model(ExeVRM)을 미세 조정합니다. 우리의 ExeVRM 8B는 Ubuntu, macOS, Windows, Android 전반에서 비디오 실행 평가에서 84.7%의 정확도와 87.7%의 재현율을 달성하며, GPT-5.2 및 Gemini-3 Pro와 같은 강력한 독점 모델을 능가하고 더 정밀한 시간적 귀속을 제공합니다. 이러한 결과는 비디오 실행 보상 모델링이 CUA를 위한 확장 가능하고 모델에 독립적인 평가자 역할을 할 수 있음을 보여줍니다.

English

Computer-using agents (CUAs) are becoming increasingly capable; however, it remains difficult to scale evaluation of whether a trajectory truly fulfills a user instruction. In this work, we study reward modeling from execution video: a sequence of keyframes from an agent trajectory that is independent of the agent's internal reasoning or actions. Although video-execution modeling is method-agnostic, it presents key challenges, including highly redundant layouts and subtle, localized cues that determine success. We introduce Execution Video Reward 53k (ExeVR-53k), a dataset of 53k high-quality video--task--reward triplets. We further propose adversarial instruction translation to synthesize negative samples with step-level annotations. To enable learning from long, high-resolution execution videos, we design spatiotemporal token pruning, which removes homogeneous regions and persistent tokens while preserving decisive UI changes. Building on these components, we fine-tune an Execution Video Reward Model (ExeVRM) that takes only a user instruction and a video-execution sequence to predict task success. Our ExeVRM 8B achieves 84.7% accuracy and 87.7% recall on video-execution assessment, outperforming strong proprietary models such as GPT-5.2 and Gemini-3 Pro across Ubuntu, macOS, Windows, and Android, while providing more precise temporal attribution. These results show that video-execution reward modeling can serve as a scalable, model-agnostic evaluator for CUAs.

컴퓨터 사용 에이전트를 위한 비디오 기반 보상 모델링

Video-Based Reward Modeling for Computer-Use Agents

초록

Support