章魚:從環境反饋中具體化的視覺語言程序員
Octopus: Embodied Vision-Language Programmer from Environmental Feedback
October 12, 2023
作者: Jingkang Yang, Yuhao Dong, Shuai Liu, Bo Li, Ziyue Wang, Chencheng Jiang, Haoran Tan, Jiamu Kang, Yuanhan Zhang, Kaiyang Zhou, Ziwei Liu
cs.AI
摘要
大型視覺語言模型(VLMs)在多模態感知和推理方面取得了顯著進展。此外,當無縫集成到具有身體的代理中時,這標誌著邁出了關鍵一步,朝著創建能夠制定計劃並精確執行命令的自主且具有上下文意識的系統的方向邁進。在本文中,我們介紹了章魚(Octopus),這是一種新型的VLM,旨在能夠熟練解讀代理的視覺和文本任務目標,並制定複雜的動作序列並生成可執行的代碼。我們的設計使代理能夠熟練處理各種任務,從模擬器中的日常琐事到複雜視頻遊戲中的精細互動。章魚是通過利用GPT-4來訓練的,以控制一個探索性代理生成訓練數據,即行動藍圖和相應的可執行代碼,在我們的實驗環境OctoVerse中。我們還收集了反饋,這使得增強訓練方案的強化學習與環境反饋(RLEF)成為可能。通過一系列實驗,我們闡明了章魚的功能性並提出了引人入勝的結果,提出的RLEF證明能夠改進代理的決策能力。通過開源我們的模型架構、模擬器和數據集,我們希望激發進一步的創新並促進更廣泛的具有身體AI社區內的合作應用。
English
Large vision-language models (VLMs) have achieved substantial progress in
multimodal perception and reasoning. Furthermore, when seamlessly integrated
into an embodied agent, it signifies a crucial stride towards the creation of
autonomous and context-aware systems capable of formulating plans and executing
commands with precision. In this paper, we introduce Octopus, a novel VLM
designed to proficiently decipher an agent's vision and textual task objectives
and to formulate intricate action sequences and generate executable code. Our
design allows the agent to adeptly handle a wide spectrum of tasks, ranging
from mundane daily chores in simulators to sophisticated interactions in
complex video games. Octopus is trained by leveraging GPT-4 to control an
explorative agent to generate training data, i.e., action blueprints and the
corresponding executable code, within our experimental environment called
OctoVerse. We also collect the feedback that allows the enhanced training
scheme of Reinforcement Learning with Environmental Feedback (RLEF). Through a
series of experiments, we illuminate Octopus's functionality and present
compelling results, and the proposed RLEF turns out to refine the agent's
decision-making. By open-sourcing our model architecture, simulator, and
dataset, we aspire to ignite further innovation and foster collaborative
applications within the broader embodied AI community.