ChatPaper.aiChatPaper

Step-GUI技術報告

Step-GUI Technical Report

December 17, 2025
作者: Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, Shiliang Yang, Zhirui Wang, Brian Li, Kang An, Chenyang Li, Lei Lei, Mengmeng Duan, Danxun Liang, Guodong Liu, Hang Cheng, Hao Wu, Jie Dong, Junhao Huang, Mei Chen, Renjie Yu, Shunshan Li, Xu Zhou, Yiting Dai, Yineng Deng, Yingdan Liang, Zelin Chen, Wen Sun, Chengxu Yan, Chunqin Xu, Dong Li, Fengqiong Xiao, Guanghao Fan, Guopeng Li, Guozhen Peng, Hongbing Li, Hang Li, Hongming Chen, Jingjing Xie, Jianyong Li, Jingyang Zhang, Jiaju Ren, Jiayu Yuan, Jianpeng Yin, Kai Cao, Liang Zhao, Liguo Tan, Liying Shi, Mengqiang Ren, Min Xu, Manjiao Liu, Mao Luo, Mingxin Wan, Na Wang, Nan Wu, Ning Wang, Peiyao Ma, Qingzhou Zhang, Qiao Wang, Qinlin Zeng, Qiong Gao, Qiongyao Li, Shangwu Zhong, Shuli Gao, Shaofan Liu, Shisi Gao, Shuang Luo, Xingbin Liu, Xiaojia Liu, Xiaojie Hou, Xin Liu, Xuanti Feng, Xuedan Cai, Xuan Wen, Xianwei Zhu, Xin Liang, Xin Liu, Xin Zhou, Yingxiu Zhao, Yukang Shi, Yunfang Xu, Yuqing Zeng, Yixun Zhang, Zejia Weng, Zhonghao Yan, Zhiguo Huang, Zhuoyu Wang, Zheng Ge, Jing Li, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Daxin Jiang
cs.AI

摘要

近期多模態大型語言模型的突破性進展,為圖形使用者介面自動化開創了前所未有的機遇。然而,核心挑戰依然存在:如何在保持標註可靠性的前提下,高效獲取高品質訓練資料?我們提出由校準步進獎勵系統驅動的自演化訓練管線,通過軌跡級校準將模型生成軌跡轉化為可靠訓練信號,實現超過90%的標註準確率,同時將成本降低10-100倍。基於此管線,我們推出Step-GUI模型系列(4B/8B),在保持強大通用能力的同時達成業界頂尖的GUI效能(8B模型:AndroidWorld 80.2%、OSWorld 48.5%、ScreenShot-Pro 62.6%)。隨著GUI智慧體能力提升,實際部署需在異構設備間建立標準化介面,同時保障使用者隱私。為此,我們提出首個GUI自動化專用的模型情境協定GUI-MCP,採用結合底層原子操作與高層任務委派給本地專用模型的分層架構,實現敏感資料全程駐留設備的高隱私執行方案。最後,為評估智慧體處理真實日常使用場景的能力,我們建立基於實際手機使用模式的AndroidDaily基準測試,包含3146項靜態操作與235個端到端任務,覆蓋高頻日常情境(8B模型:靜態任務89.91%,端到端任務52.50%)。本研究推動實用型GUI智慧體的發展,並展現其在日常數位互動中實際部署的強大潛力。
English
Recent advances in multimodal large language models unlock unprecedented opportunities for GUI automation. However, a fundamental challenge remains: how to efficiently acquire high-quality training data while maintaining annotation reliability? We introduce a self-evolving training pipeline powered by the Calibrated Step Reward System, which converts model-generated trajectories into reliable training signals through trajectory-level calibration, achieving >90% annotation accuracy with 10-100x lower cost. Leveraging this pipeline, we introduce Step-GUI, a family of models (4B/8B) that achieves state-of-the-art GUI performance (8B: 80.2% AndroidWorld, 48.5% OSWorld, 62.6% ScreenShot-Pro) while maintaining robust general capabilities. As GUI agent capabilities improve, practical deployment demands standardized interfaces across heterogeneous devices while protecting user privacy. To this end, we propose GUI-MCP, the first Model Context Protocol for GUI automation with hierarchical architecture that combines low-level atomic operations and high-level task delegation to local specialist models, enabling high-privacy execution where sensitive data stays on-device. Finally, to assess whether agents can handle authentic everyday usage, we introduce AndroidDaily, a benchmark grounded in real-world mobile usage patterns with 3146 static actions and 235 end-to-end tasks across high-frequency daily scenarios (8B: static 89.91%, end-to-end 52.50%). Our work advances the development of practical GUI agents and demonstrates strong potential for real-world deployment in everyday digital interactions.
PDF1132December 19, 2025