輕量級神經應用控制
Lightweight Neural App Control
October 23, 2024
作者: Filippos Christianos, Georgios Papoudakis, Thomas Coste, Jianye Hao, Jun Wang, Kun Shao
cs.AI
摘要
本文介紹了一種新穎的手機控制架構,稱為「應用程式代理」,用於在各種Android應用程式之間進行高效的互動和控制。所提出的輕量級多模應用程式控制(LiMAC)接受文本目標和過去手機觀察序列(例如截圖和相應的UI樹)作為輸入,以生成精確的操作。為應對智能手機固有的計算限制,在LiMAC中,我們引入了一個小型行動轉換器(AcT),與一個經過微調的視覺語言模型(VLM)結合,用於即時決策和任務執行。我們在兩個開源手機控制數據集上評估了LiMAC,展示了我們的小型形式因子方法相對於經過微調的開源VLM版本(如Florence2和Qwen2-VL)的優越性能。它還明顯優於利用閉源基礎模型(如GPT-4o)的提示工程基準。具體而言,LiMAC相對於經過微調的VLM,將整體操作準確性提高了高達19%,相對於提示工程基準,提高了高達42%。
English
This paper introduces a novel mobile phone control architecture, termed ``app
agents", for efficient interactions and controls across various Android apps.
The proposed Lightweight Multi-modal App Control (LiMAC) takes as input a
textual goal and a sequence of past mobile observations, such as screenshots
and corresponding UI trees, to generate precise actions. To address the
computational constraints inherent to smartphones, within LiMAC, we introduce a
small Action Transformer (AcT) integrated with a fine-tuned vision-language
model (VLM) for real-time decision-making and task execution. We evaluate LiMAC
on two open-source mobile control datasets, demonstrating the superior
performance of our small-form-factor approach against fine-tuned versions of
open-source VLMs, such as Florence2 and Qwen2-VL. It also significantly
outperforms prompt engineering baselines utilising closed-source foundation
models like GPT-4o. More specifically, LiMAC increases the overall action
accuracy by up to 19% compared to fine-tuned VLMs, and up to 42% compared to
prompt-engineering baselines.Summary
AI-Generated Summary