移动代理-V：通过视频引导的多代理协作学习移动设备操作

摘要

隨著移動設備使用量的快速增長，提升自動化水平以實現無縫任務管理變得至關重要。然而，許多基於人工智慧的框架因操作知識不足而面臨挑戰。手動編寫的知識雖有幫助，但耗時且效率低下。為應對這些挑戰，我們推出了Mobile-Agent-V框架，該框架利用視頻指導提供豐富且成本效益高的操作知識，從而增強移動自動化能力。Mobile-Agent-V通過利用視頻輸入來提升任務執行能力，無需專門的採樣或預處理。該框架整合了滑動窗口策略，並引入了視頻代理和深度反思代理，以確保操作與用戶指令保持一致。通過這一創新方法，用戶可以在指導下記錄任務流程，使系統能夠自主學習並高效執行任務。實驗結果表明，Mobile-Agent-V相比現有框架實現了30%的性能提升。

English

The rapid increase in mobile device usage necessitates improved automation for seamless task management. However, many AI-driven frameworks struggle due to insufficient operational knowledge. Manually written knowledge helps but is labor-intensive and inefficient. To address these challenges, we introduce Mobile-Agent-V, a framework that leverages video guidance to provide rich and cost-effective operational knowledge for mobile automation. Mobile-Agent-V enhances task execution capabilities by leveraging video inputs without requiring specialized sampling or preprocessing. Mobile-Agent-V integrates a sliding window strategy and incorporates a video agent and deep-reflection agent to ensure that actions align with user instructions. Through this innovative approach, users can record task processes with guidance, enabling the system to autonomously learn and execute tasks efficiently. Experimental results show that Mobile-Agent-V achieves a 30% performance improvement compared to existing frameworks.

移动代理-V：通过视频引导的多代理协作学习移动设备操作

Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration

摘要

Support