轻量级神经应用控制
Lightweight Neural App Control
October 23, 2024
作者: Filippos Christianos, Georgios Papoudakis, Thomas Coste, Jianye Hao, Jun Wang, Kun Shao
cs.AI
摘要
本文介绍了一种新颖的移动电话控制架构,称为“应用程序代理”,用于在各种安卓应用程序之间实现高效的交互和控制。所提出的轻量级多模态应用程序控制(LiMAC)以文本目标和一系列过去的移动观察(如屏幕截图和相应的UI树)作为输入,以生成精确的动作。为了解决智能手机固有的计算约束,在LiMAC中,我们引入了一个小型动作转换器(AcT),与经过微调的视觉语言模型(VLM)集成,用于实时决策和任务执行。我们在两个开源移动控制数据集上评估了LiMAC,展示了我们的小型形态方法相对于经过微调的开源VLM版本(如Florence2和Qwen2-VL)的卓越性能。它还明显优于利用GPT-4o等闭源基础模型的提示工程基线。具体而言,LiMAC相对于经过微调的VLM,将整体动作准确性提高了高达19%,相对于提示工程基线提高了高达42%。
English
This paper introduces a novel mobile phone control architecture, termed ``app
agents", for efficient interactions and controls across various Android apps.
The proposed Lightweight Multi-modal App Control (LiMAC) takes as input a
textual goal and a sequence of past mobile observations, such as screenshots
and corresponding UI trees, to generate precise actions. To address the
computational constraints inherent to smartphones, within LiMAC, we introduce a
small Action Transformer (AcT) integrated with a fine-tuned vision-language
model (VLM) for real-time decision-making and task execution. We evaluate LiMAC
on two open-source mobile control datasets, demonstrating the superior
performance of our small-form-factor approach against fine-tuned versions of
open-source VLMs, such as Florence2 and Qwen2-VL. It also significantly
outperforms prompt engineering baselines utilising closed-source foundation
models like GPT-4o. More specifically, LiMAC increases the overall action
accuracy by up to 19% compared to fine-tuned VLMs, and up to 42% compared to
prompt-engineering baselines.Summary
AI-Generated Summary