基于视频的计算机使用智能体奖励建模

摘要

计算机使用智能体（CUA）正变得日益强大，但如何有效评估其执行轨迹是否真实满足用户指令仍存在挑战。本研究探索基于执行视频的奖励建模方法：通过从智能体轨迹中提取关键帧序列，构建独立于内部推理或操作行为的评估体系。尽管视频执行建模具有方法无关性，但其面临布局高度冗余、成功判断依赖局部细微线索等关键挑战。我们推出包含5.3万组高质量视频-任务-奖励三元组的ExeVR-53k数据集，并提出通过对抗性指令转译生成带有步骤级标注的负样本。为处理长时高分辨率执行视频，我们设计时空令牌剪枝技术，在保留决定性界面变化的同时剔除同质化区域和持续静态令牌。基于这些组件，我们微调出仅需用户指令与视频执行序列即可预测任务成功率的执行视频奖励模型（ExeVRM）。我们的ExeVRM 8B模型在视频执行评估中达到84.7%准确率和87.7%召回率，在Ubuntu、macOS、Windows和Android系统上均优于GPT-5.2、Gemini-3 Pro等强基线模型，且能提供更精确的时间归因。这些结果表明，视频执行奖励建模可成为CUA领域可扩展的模型无关评估方案。

English

Computer-using agents (CUAs) are becoming increasingly capable; however, it remains difficult to scale evaluation of whether a trajectory truly fulfills a user instruction. In this work, we study reward modeling from execution video: a sequence of keyframes from an agent trajectory that is independent of the agent's internal reasoning or actions. Although video-execution modeling is method-agnostic, it presents key challenges, including highly redundant layouts and subtle, localized cues that determine success. We introduce Execution Video Reward 53k (ExeVR-53k), a dataset of 53k high-quality video--task--reward triplets. We further propose adversarial instruction translation to synthesize negative samples with step-level annotations. To enable learning from long, high-resolution execution videos, we design spatiotemporal token pruning, which removes homogeneous regions and persistent tokens while preserving decisive UI changes. Building on these components, we fine-tune an Execution Video Reward Model (ExeVRM) that takes only a user instruction and a video-execution sequence to predict task success. Our ExeVRM 8B achieves 84.7% accuracy and 87.7% recall on video-execution assessment, outperforming strong proprietary models such as GPT-5.2 and Gemini-3 Pro across Ubuntu, macOS, Windows, and Android, while providing more precise temporal attribution. These results show that video-execution reward modeling can serve as a scalable, model-agnostic evaluator for CUAs.