옥토퍼스: 환경 피드백 기반의 구체화된 비전-언어 프로그래머

초록

대규모 시각-언어 모델(VLMs)은 다중 모달 인식과 추론 분야에서 상당한 진전을 이루었습니다. 더 나아가, 이러한 모델이 구현된 에이전트에 원활하게 통합될 경우, 정밀한 계획 수립과 명령 실행이 가능한 자율적이고 상황 인식이 가능한 시스템을 만드는 데 있어 중요한 진전을 의미합니다. 본 논문에서는 에이전트의 시각 및 텍스트 기반 작업 목표를 능숙하게 해석하고 복잡한 행동 시퀀스를 구성하며 실행 가능한 코드를 생성할 수 있는 새로운 VLM인 Octopus를 소개합니다. 우리의 설계는 시뮬레이터 내 일상적인 작업부터 복잡한 비디오 게임에서의 정교한 상호작용에 이르기까지 다양한 작업을 능숙하게 처리할 수 있도록 합니다. Octopus는 GPT-4를 활용하여 탐색 에이전트를 제어함으로써 OctoVerse라는 실험 환경 내에서 훈련 데이터, 즉 행동 청사진과 해당 실행 코드를 생성하도록 학습됩니다. 또한, 환경 피드백을 통한 강화 학습(RLEF)이라는 향상된 훈련 방식을 가능하게 하는 피드백을 수집합니다. 일련의 실험을 통해 Octopus의 기능을 조명하고 설득력 있는 결과를 제시하며, 제안된 RLEF가 에이전트의 의사결정을 개선하는 것으로 나타났습니다. 우리는 모델 아키텍처, 시뮬레이터 및 데이터셋을 오픈소스로 공개함으로써 더 넓은 구현형 AI 커뮤니티 내에서의 혁신과 협업적 응용을 촉발하고자 합니다.

English

Large vision-language models (VLMs) have achieved substantial progress in multimodal perception and reasoning. Furthermore, when seamlessly integrated into an embodied agent, it signifies a crucial stride towards the creation of autonomous and context-aware systems capable of formulating plans and executing commands with precision. In this paper, we introduce Octopus, a novel VLM designed to proficiently decipher an agent's vision and textual task objectives and to formulate intricate action sequences and generate executable code. Our design allows the agent to adeptly handle a wide spectrum of tasks, ranging from mundane daily chores in simulators to sophisticated interactions in complex video games. Octopus is trained by leveraging GPT-4 to control an explorative agent to generate training data, i.e., action blueprints and the corresponding executable code, within our experimental environment called OctoVerse. We also collect the feedback that allows the enhanced training scheme of Reinforcement Learning with Environmental Feedback (RLEF). Through a series of experiments, we illuminate Octopus's functionality and present compelling results, and the proposed RLEF turns out to refine the agent's decision-making. By open-sourcing our model architecture, simulator, and dataset, we aspire to ignite further innovation and foster collaborative applications within the broader embodied AI community.

옥토퍼스: 환경 피드백 기반의 구체화된 비전-언어 프로그래머

Octopus: Embodied Vision-Language Programmer from Environmental Feedback

초록

Support