STEVE：计算机使用代理训练的逐步验证管道

摘要

开发能够自主操作图形用户界面的AI代理是一项长期且具有挑战性的任务。数据规模定律的最新进展启发我们利用大规模指令集来训练计算机使用代理，然而，采用行为克隆方法训练代理仍需大量高质量轨迹数据。为满足可扩展性需求，我们设计了STEVE，一个用于计算机使用代理训练的步骤验证流程。首先，我们为计算机使用代理建立了一个庞大的指令集，并通过一些次优代理收集轨迹数据。随后，利用GPT-4o根据动作执行前后的屏幕信息验证轨迹中每一步的正确性，并为每一步赋予二元标签。最后，我们采用卡尼曼和特沃斯基优化方法，基于这些二元步骤标签优化代理。大量实验表明，通过充分利用轨迹中的正负动作，我们的代理在性能上超越了监督微调方法。此外，STEVE使我们能够训练一个70亿参数的视觉语言模型作为计算机使用代理，在极具挑战性的实时桌面环境WinAgentArena中取得了领先的性能，同时以更低的成本实现了高效运行。代码与数据详见：https://github.com/FanbinLu/STEVE。

English

Developing AI agents to autonomously manipulate graphical user interfaces is a long challenging task. Recent advances in data scaling law inspire us to train computer-use agents with a scaled instruction set, yet using behavior cloning to train agents still requires immense high-quality trajectories. To meet the scalability need, we designed STEVE, a step verification pipeline for computer-use agent training. First, we establish a large instruction set for computer-use agents and collect trajectory data with some suboptimal agents. GPT-4o is used to verify the correctness of each step in the trajectories based on the screens before and after the action execution, assigning each step with a binary label. Last, we adopt the Kahneman and Tversky Optimization to optimize the agent from the binary stepwise labels. Extensive experiments manifest that our agent outperforms supervised finetuning by leveraging both positive and negative actions within a trajectory. Also, STEVE enables us to train a 7B vision-language model as a computer-use agent, achieving leading performance in the challenging live desktop environment WinAgentArena with great efficiency at a reduced cost. Code and data: https://github.com/FanbinLu/STEVE.

STEVE：计算机使用代理训练的逐步验证管道

STEVE: AStep Verification Pipeline for Computer-use Agent Training

摘要

Support