章魚：從環境反饋中具體化的視覺語言程序員

摘要

大型視覺語言模型（VLMs）在多模態感知和推理方面取得了顯著進展。此外，當無縫集成到具有身體的代理中時，這標誌著邁出了關鍵一步，朝著創建能夠制定計劃並精確執行命令的自主且具有上下文意識的系統的方向邁進。在本文中，我們介紹了章魚（Octopus），這是一種新型的VLM，旨在能夠熟練解讀代理的視覺和文本任務目標，並制定複雜的動作序列並生成可執行的代碼。我們的設計使代理能夠熟練處理各種任務，從模擬器中的日常琐事到複雜視頻遊戲中的精細互動。章魚是通過利用GPT-4來訓練的，以控制一個探索性代理生成訓練數據，即行動藍圖和相應的可執行代碼，在我們的實驗環境OctoVerse中。我們還收集了反饋，這使得增強訓練方案的強化學習與環境反饋（RLEF）成為可能。通過一系列實驗，我們闡明了章魚的功能性並提出了引人入勝的結果，提出的RLEF證明能夠改進代理的決策能力。通過開源我們的模型架構、模擬器和數據集，我們希望激發進一步的創新並促進更廣泛的具有身體AI社區內的合作應用。

English

Large vision-language models (VLMs) have achieved substantial progress in multimodal perception and reasoning. Furthermore, when seamlessly integrated into an embodied agent, it signifies a crucial stride towards the creation of autonomous and context-aware systems capable of formulating plans and executing commands with precision. In this paper, we introduce Octopus, a novel VLM designed to proficiently decipher an agent's vision and textual task objectives and to formulate intricate action sequences and generate executable code. Our design allows the agent to adeptly handle a wide spectrum of tasks, ranging from mundane daily chores in simulators to sophisticated interactions in complex video games. Octopus is trained by leveraging GPT-4 to control an explorative agent to generate training data, i.e., action blueprints and the corresponding executable code, within our experimental environment called OctoVerse. We also collect the feedback that allows the enhanced training scheme of Reinforcement Learning with Environmental Feedback (RLEF). Through a series of experiments, we illuminate Octopus's functionality and present compelling results, and the proposed RLEF turns out to refine the agent's decision-making. By open-sourcing our model architecture, simulator, and dataset, we aspire to ignite further innovation and foster collaborative applications within the broader embodied AI community.

章魚：從環境反饋中具體化的視覺語言程序員

Octopus: Embodied Vision-Language Programmer from Environmental Feedback

摘要

Support