コンピュータ利用エージェントのための映像ベース報酬モデリング

要旨

コンピュータ利用エージェント（CUA）の能力は急速に高度化しているが、エージェントの軌跡が真にユーザーの指示を満たしているかどうかの評価をスケールさせることは依然として困難である。本研究では、実行ビデオからの報酬モデリングに着目する。これはエージェントの内部推論や行動に依存しない、エージェント軌跡からのキーフレームシーケンスである。ビデオ実行モデリングは手法に依存しないが、高度に冗長なレイアウトや成功を決定する微妙な局所的キューといった重大な課題を提示する。我々は、5万3千の高品質なビデオ・タスク・報酬トリプルからなるデータセットExecution Video Reward 53k（ExeVR-53k）を導入する。さらに、ステップ単位のアノテーション付き負例を合成するための敵対的指示翻訳を提案する。長く高解像度の実行ビデオからの学習を可能にするため、均質な領域や持続的なトークンを除去しつつ決定的なUI変化を保存する時空間トークンプルーニングを設計する。これらの構成要素に基づき、ユーザー指示とビデオ実行シーケンスのみを入力としてタスク成功を予測するExecution Video Reward Model（ExeVRM）をファインチューニングする。我々のExeVRM 8Bは、ビデオ実行評価において84.7%の精度と87.7%の再現率を達成し、Ubuntu、macOS、Windows、AndroidにわたってGPT-5.2やGemini-3 Proといった強力なプロプライエタリモデルを上回り、より精密な時間的帰属を提供する。これらの結果は、ビデオ実行報酬モデリングがCUAのためのスケーラブルでモデル非依存の評価器として機能し得ることを示している。

English

Computer-using agents (CUAs) are becoming increasingly capable; however, it remains difficult to scale evaluation of whether a trajectory truly fulfills a user instruction. In this work, we study reward modeling from execution video: a sequence of keyframes from an agent trajectory that is independent of the agent's internal reasoning or actions. Although video-execution modeling is method-agnostic, it presents key challenges, including highly redundant layouts and subtle, localized cues that determine success. We introduce Execution Video Reward 53k (ExeVR-53k), a dataset of 53k high-quality video--task--reward triplets. We further propose adversarial instruction translation to synthesize negative samples with step-level annotations. To enable learning from long, high-resolution execution videos, we design spatiotemporal token pruning, which removes homogeneous regions and persistent tokens while preserving decisive UI changes. Building on these components, we fine-tune an Execution Video Reward Model (ExeVRM) that takes only a user instruction and a video-execution sequence to predict task success. Our ExeVRM 8B achieves 84.7% accuracy and 87.7% recall on video-execution assessment, outperforming strong proprietary models such as GPT-5.2 and Gemini-3 Pro across Ubuntu, macOS, Windows, and Android, while providing more precise temporal attribution. These results show that video-execution reward modeling can serve as a scalable, model-agnostic evaluator for CUAs.

コンピュータ利用エージェントのための映像ベース報酬モデリング

Video-Based Reward Modeling for Computer-Use Agents

要旨

Support