STEVE：計算機使用代理訓練的逐步驗證管道

摘要

開發能夠自主操作圖形用戶界面的AI代理是一項長期且具挑戰性的任務。數據規模定律的最新進展啟發我們使用擴展的指令集來訓練計算機使用代理，然而，使用行為克隆來訓練代理仍然需要大量高質量的軌跡數據。為滿足可擴展性需求，我們設計了STEVE，這是一個用於計算機使用代理訓練的步驟驗證管道。首先，我們為計算機使用代理建立了一個大型指令集，並使用一些次優代理收集軌跡數據。GPT-4o被用來根據動作執行前後的屏幕來驗證軌跡中每個步驟的正確性，並為每個步驟分配一個二元標籤。最後，我們採用卡尼曼和特沃斯基優化法，從二元步驟標籤中優化代理。大量實驗表明，我們的代理通過利用軌跡中的正向和負向動作，超越了監督微調的性能。此外，STEVE使我們能夠將一個7B的視覺語言模型訓練為計算機使用代理，在具有挑戰性的實時桌面環境WinAgentArena中實現了領先的性能，並以更低的成本實現了高效運行。代碼和數據：https://github.com/FanbinLu/STEVE。

English

Developing AI agents to autonomously manipulate graphical user interfaces is a long challenging task. Recent advances in data scaling law inspire us to train computer-use agents with a scaled instruction set, yet using behavior cloning to train agents still requires immense high-quality trajectories. To meet the scalability need, we designed STEVE, a step verification pipeline for computer-use agent training. First, we establish a large instruction set for computer-use agents and collect trajectory data with some suboptimal agents. GPT-4o is used to verify the correctness of each step in the trajectories based on the screens before and after the action execution, assigning each step with a binary label. Last, we adopt the Kahneman and Tversky Optimization to optimize the agent from the binary stepwise labels. Extensive experiments manifest that our agent outperforms supervised finetuning by leveraging both positive and negative actions within a trajectory. Also, STEVE enables us to train a 7B vision-language model as a computer-use agent, achieving leading performance in the challenging live desktop environment WinAgentArena with great efficiency at a reduced cost. Code and data: https://github.com/FanbinLu/STEVE.

STEVE：計算機使用代理訓練的逐步驗證管道

STEVE: AStep Verification Pipeline for Computer-use Agent Training

摘要

Support