基於影片回饋建模的電腦使用代理程式
Video-Based Reward Modeling for Computer-Use Agents
March 10, 2026
作者: Linxin Song, Jieyu Zhang, Huanxin Sheng, Taiwei Shi, Gupta Rahul, Yang Liu, Ranjay Krishna, Jian Kang, Jieyu Zhao
cs.AI
摘要
儘管電腦使用代理(CUA)的能力日益增強,但如何有效評估其執行軌跡是否真實符合使用者指令仍存在擴展性難題。本研究專注於基於執行影片的獎勵建模:透過從代理軌跡中提取關鍵影格序列,該方法獨立於代理的內部推理或操作行為。雖然影片執行建模具有方法無關性,但仍面臨關鍵挑戰,包括高度冗餘的界面佈局與決定成敗的細微局部線索。我們提出Execution Video Reward 53k(ExeVR-53k)資料集,包含5.3萬組高品質的影片-任務-獎勵三元組,並透過對抗式指令轉譯技術合成帶有步驟級註釋的負樣本。為實現對長時序高解析度執行影片的學習,我們設計時空標記修剪技術,在保留決定性UI變化的同時去除同質化區域與持續性標記。基於這些組件,我們微調出僅需使用者指令與影片執行序列即可預測任務成功率的執行影片獎勵模型(ExeVRM)。我們的ExeVRM 8B模型在影片執行評估中達到84.7%準確率與87.7%召回率,於Ubuntu、macOS、Windows及Android四大平台均勝過GPT-5.2與Gemini-3 Pro等強力專有模型,且提供更精準的時間歸因。這些成果證明影片執行獎勵建模可作為CUA的可擴展、模型無關評估方案。
English
Computer-using agents (CUAs) are becoming increasingly capable; however, it remains difficult to scale evaluation of whether a trajectory truly fulfills a user instruction. In this work, we study reward modeling from execution video: a sequence of keyframes from an agent trajectory that is independent of the agent's internal reasoning or actions. Although video-execution modeling is method-agnostic, it presents key challenges, including highly redundant layouts and subtle, localized cues that determine success. We introduce Execution Video Reward 53k (ExeVR-53k), a dataset of 53k high-quality video--task--reward triplets. We further propose adversarial instruction translation to synthesize negative samples with step-level annotations. To enable learning from long, high-resolution execution videos, we design spatiotemporal token pruning, which removes homogeneous regions and persistent tokens while preserving decisive UI changes. Building on these components, we fine-tune an Execution Video Reward Model (ExeVRM) that takes only a user instruction and a video-execution sequence to predict task success. Our ExeVRM 8B achieves 84.7% accuracy and 87.7% recall on video-execution assessment, outperforming strong proprietary models such as GPT-5.2 and Gemini-3 Pro across Ubuntu, macOS, Windows, and Android, while providing more precise temporal attribution. These results show that video-execution reward modeling can serve as a scalable, model-agnostic evaluator for CUAs.