ChatPaper.aiChatPaper

DigiRL:使用自主強化學習訓練野外設備控制代理

DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning

June 14, 2024
作者: Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, Aviral Kumar
cs.AI

摘要

視覺語言模型(VLMs)的訓練語料庫通常缺乏足夠的以決策為中心的數據。這使得現成的VLMs在決策任務(例如通過圖形用戶界面(GUIs)進行野外設備控制)中表現不佳。雖然使用靜態演示進行訓練顯示出一些潛力,但我們發現這些方法在控制真實GUIs方面表現不佳,因為它們無法應對現實世界中的隨機性和非穩定性,這些因素在靜態觀察數據中無法捕捉。本文介紹了一種新的自主強化學習方法,稱為DigiRL,用於通過對預先訓練的VLM進行微調來訓練野外設備控制代理,該方法分為兩個階段:離線強化學習用於初始化模型,然後是離線到在線強化學習。為此,我們構建了一個可擴展且可並行化的Android學習環境,配備了基於VLM的評估器,並為在這個領域學習開發了一種簡單而有效的強化學習方法。我們的方法運行具有考慮隨機性的優勢估算器的優勢加權強化學習,以及一個用於獲取最大學習信號的自動課程。我們使用Android-in-the-Wild(AitW)數據集展示了DigiRL的有效性,我們的13億VLM在強化學習下實現了49.5%的絕對改善,成功率從17.7%提高到67.2%,超過了使用靜態人類演示數據進行監督微調的先前最佳代理,包括具有GPT-4V的AppAgent(8.3%成功率)和使用AitW數據訓練的17B CogAgent(38.5%),以及基於過濾行為克隆的先前最佳自主強化學習方法(57.8%),從而確立了野外設備控制的數字代理的新技術水平。
English
Training corpuses for vision language models (VLMs) typically lack sufficient amounts of decision-centric data. This renders off-the-shelf VLMs sub-optimal for decision-making tasks such as in-the-wild device control through graphical user interfaces (GUIs). While training with static demonstrations has shown some promise, we show that such methods fall short for controlling real GUIs due to their failure to deal with real-world stochasticity and non-stationarity not captured in static observational data. This paper introduces a novel autonomous RL approach, called DigiRL, for training in-the-wild device control agents through fine-tuning a pre-trained VLM in two stages: offline RL to initialize the model, followed by offline-to-online RL. To do this, we build a scalable and parallelizable Android learning environment equipped with a VLM-based evaluator and develop a simple yet effective RL approach for learning in this domain. Our approach runs advantage-weighted RL with advantage estimators enhanced to account for stochasticity along with an automatic curriculum for deriving maximal learning signal. We demonstrate the effectiveness of DigiRL using the Android-in-the-Wild (AitW) dataset, where our 1.3B VLM trained with RL achieves a 49.5% absolute improvement -- from 17.7 to 67.2% success rate -- over supervised fine-tuning with static human demonstration data. These results significantly surpass not only the prior best agents, including AppAgent with GPT-4V (8.3% success rate) and the 17B CogAgent trained with AitW data (38.5%), but also the prior best autonomous RL approach based on filtered behavior cloning (57.8%), thereby establishing a new state-of-the-art for digital agents for in-the-wild device control.

Summary

AI-Generated Summary

PDF201December 2, 2024