Mobile-Agent-V: ビデオガイドによるマルチエージェント協調を通じたモバイルデバイス操作の学習

要旨

モバイルデバイスの利用が急速に増加する中、シームレスなタスク管理のための自動化の向上が求められています。しかし、多くのAI駆動型フレームワークは、操作知識の不足に悩まされています。手動で記述された知識は役立ちますが、労力がかかり非効率的です。これらの課題に対処するため、我々はMobile-Agent-Vを提案します。このフレームワークは、ビデオガイダンスを活用して、モバイル自動化のための豊富でコスト効率の高い操作知識を提供します。Mobile-Agent-Vは、特別なサンプリングや前処理を必要とせずに、ビデオ入力を活用してタスク実行能力を向上させます。Mobile-Agent-Vは、スライディングウィンドウ戦略を統合し、ビデオエージェントと深層反射エージェントを組み込むことで、アクションがユーザーの指示に沿うことを保証します。この革新的なアプローチにより、ユーザーはガイダンス付きでタスクプロセスを記録し、システムが自律的に学習して効率的にタスクを実行できるようになります。実験結果は、Mobile-Agent-Vが既存のフレームワークと比較して30%の性能向上を達成することを示しています。

English

The rapid increase in mobile device usage necessitates improved automation for seamless task management. However, many AI-driven frameworks struggle due to insufficient operational knowledge. Manually written knowledge helps but is labor-intensive and inefficient. To address these challenges, we introduce Mobile-Agent-V, a framework that leverages video guidance to provide rich and cost-effective operational knowledge for mobile automation. Mobile-Agent-V enhances task execution capabilities by leveraging video inputs without requiring specialized sampling or preprocessing. Mobile-Agent-V integrates a sliding window strategy and incorporates a video agent and deep-reflection agent to ensure that actions align with user instructions. Through this innovative approach, users can record task processes with guidance, enabling the system to autonomously learn and execute tasks efficiently. Experimental results show that Mobile-Agent-V achieves a 30% performance improvement compared to existing frameworks.

Mobile-Agent-V: ビデオガイドによるマルチエージェント協調を通じたモバイルデバイス操作の学習

Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration

要旨

Support