JARVIS-1: 메모리 증강 멀티모달 언어 모델 기반의 오픈 월드 다중 작업 에이전트

초록

개방형 세계에서 다중 모달 관측을 통해 인간과 유사한 계획 및 제어를 달성하는 것은 보다 기능적인 일반 지능 에이전트를 위한 핵심 이정표이다. 기존 접근법들은 개방형 세계에서 특정 장기적 과제를 처리할 수 있지만, 개방형 세계의 과제 수가 무한할 가능성이 있는 경우에는 여전히 어려움을 겪으며, 게임 시간이 진행됨에 따라 과제 완료를 점진적으로 향상시키는 능력이 부족하다. 본 연구에서는 인기 있으면서도 도전적인 개방형 세계인 마인크래프트(Minecraft) 유니버스 내에서 다중 모달 입력(시각적 관측 및 인간의 지시)을 인지하고, 정교한 계획을 생성하며, 구체화된 제어를 수행할 수 있는 개방형 세계 에이전트 JARVIS-1을 소개한다. 구체적으로, 우리는 시각적 관측과 텍스트 지시를 계획으로 매핑하는 사전 훈련된 다중 모달 언어 모델 위에 JARVIS-1을 개발하였다. 이 계획은 궁극적으로 목표 조건 제어기에 전달된다. 우리는 JARVIS-1에 사전 훈련된 지식과 실제 게임 생존 경험을 모두 활용하여 계획을 수립할 수 있는 다중 모달 메모리를 장착하였다. 실험에서 JARVIS-1은 마인크래프트 유니버스 벤치마크의 초급부터 중급 수준에 이르는 200개 이상의 다양한 과제에서 거의 완벽한 성능을 보였다. JARVIS-1은 장기적 과제인 다이아몬드 곡괭이 제작 과제에서 12.5%의 완료율을 달성했으며, 이는 이전 기록에 비해 최대 5배 증가한 수치이다. 또한, 우리는 JARVIS-1이 다중 모달 메모리 덕분에 평생 학습 패러다임을 통해 자기 개선이 가능하며, 보다 일반적인 지능과 향상된 자율성을 발휘할 수 있음을 보여준다. 프로젝트 페이지는 https://craftjarvis-jarvis1.github.io에서 확인할 수 있다.

English

Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents. Existing approaches can handle certain long-horizon tasks in an open world. However, they still struggle when the number of open-world tasks could potentially be infinite and lack the capability to progressively enhance task completion as game time progresses. We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions), generate sophisticated plans, and perform embodied control, all within the popular yet challenging open-world Minecraft universe. Specifically, we develop JARVIS-1 on top of pre-trained multimodal language models, which map visual observations and textual instructions to plans. The plans will be ultimately dispatched to the goal-conditioned controllers. We outfit JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences. In our experiments, JARVIS-1 exhibits nearly perfect performances across over 200 varying tasks from the Minecraft Universe Benchmark, ranging from entry to intermediate levels. JARVIS-1 has achieved a completion rate of 12.5% in the long-horizon diamond pickaxe task. This represents a significant increase up to 5 times compared to previous records. Furthermore, we show that JARVIS-1 is able to self-improve following a life-long learning paradigm thanks to multimodal memory, sparking a more general intelligence and improved autonomy. The project page is available at https://craftjarvis-jarvis1.github.io.

JARVIS-1: 메모리 증강 멀티모달 언어 모델 기반의 오픈 월드 다중 작업 에이전트

JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models

초록

Support