DigiRL:使用自主强化学习训练野外设备控制代理
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning
June 14, 2024
作者: Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, Aviral Kumar
cs.AI
摘要
视觉语言模型(VLMs)的训练语料库通常缺乏足够的决策中心数据。这使得现成的VLMs在决策任务(例如通过图形用户界面(GUIs)进行野外设备控制)方面表现不佳。虽然使用静态演示进行训练显示出一些潜力,但我们发现这种方法在控制真实GUIs方面存在不足,因为它们无法处理静态观测数据中未捕获的真实世界随机性和非稳态性。本文介绍了一种名为DigiRL的新型自主强化学习方法,用于通过两个阶段对预训练的VLM进行微调来训练野外设备控制代理:离线强化学习以初始化模型,然后是离线到在线强化学习。为此,我们构建了一个可扩展且可并行化的Android学习环境,配备了基于VLM的评估器,并为在该领域学习开发了一种简单而有效的强化学习方法。我们的方法采用了增强估计器增强的优势加权强化学习,以考虑随机性,并使用自动课程表来获取最大的学习信号。我们使用Android-in-the-Wild(AitW)数据集展示了DigiRL的有效性,我们的13亿VLM经过强化学习训练,在成功率上取得了49.5%的绝对改善,从17.7%提高到67.2%。这些结果显著超越了以往最佳代理,包括带有GPT-4V的AppAgent(8.3%成功率)和使用AitW数据训练的17B CogAgent(38.5%),以及基于过滤行为克隆的以往最佳自主强化学习方法(57.8%),从而确立了野外设备控制数字代理的新技术水平。
English
Training corpuses for vision language models (VLMs) typically lack sufficient
amounts of decision-centric data. This renders off-the-shelf VLMs sub-optimal
for decision-making tasks such as in-the-wild device control through graphical
user interfaces (GUIs). While training with static demonstrations has shown
some promise, we show that such methods fall short for controlling real GUIs
due to their failure to deal with real-world stochasticity and non-stationarity
not captured in static observational data. This paper introduces a novel
autonomous RL approach, called DigiRL, for training in-the-wild device control
agents through fine-tuning a pre-trained VLM in two stages: offline RL to
initialize the model, followed by offline-to-online RL. To do this, we build a
scalable and parallelizable Android learning environment equipped with a
VLM-based evaluator and develop a simple yet effective RL approach for learning
in this domain. Our approach runs advantage-weighted RL with advantage
estimators enhanced to account for stochasticity along with an automatic
curriculum for deriving maximal learning signal. We demonstrate the
effectiveness of DigiRL using the Android-in-the-Wild (AitW) dataset, where our
1.3B VLM trained with RL achieves a 49.5% absolute improvement -- from 17.7 to
67.2% success rate -- over supervised fine-tuning with static human
demonstration data. These results significantly surpass not only the prior best
agents, including AppAgent with GPT-4V (8.3% success rate) and the 17B CogAgent
trained with AitW data (38.5%), but also the prior best autonomous RL approach
based on filtered behavior cloning (57.8%), thereby establishing a new
state-of-the-art for digital agents for in-the-wild device control.Summary
AI-Generated Summary