언어를 통해 세계를 모델링하는 방법 학습하기

초록

세상에서 인간과 상호작용하기 위해, 에이전트는 사람들이 사용하는 다양한 유형의 언어를 이해하고 이를 시각적 세계와 연결하며 이를 바탕으로 행동할 수 있어야 합니다. 현재의 에이전트들은 작업 보상으로부터 간단한 언어 명령을 실행하는 방법을 배우지만, 우리는 일반 지식을 전달하고, 세계의 상태를 설명하며, 상호작용 피드백을 제공하는 등 다양한 언어를 활용할 수 있는 에이전트를 구축하는 것을 목표로 합니다. 우리의 핵심 아이디어는 언어가 에이전트가 미래를 예측하는 데 도움을 준다는 것입니다: 무엇이 관찰될지, 세계가 어떻게 행동할지, 어떤 상황이 보상을 받을지 등. 이러한 관점은 언어 이해와 미래 예측을 강력한 자기 지도 학습 목표로 통합합니다. 우리는 다이내랭(Dynalang)을 제시합니다. 다이내랭은 미래의 텍스트와 이미지 표현을 예측하고, 상상된 모델 롤아웃에서 행동을 학습하는 다중 모달 세계 모델을 학습하는 에이전트입니다. 전통적인 에이전트들이 언어를 단순히 행동 예측에만 사용하는 것과 달리, 다이내랭은 과거의 언어를 사용하여 미래의 언어, 비디오, 보상을 예측함으로써 풍부한 언어 이해를 획득합니다. 환경 내에서의 온라인 상호작용으로부터 학습하는 것 외에도, 다이내랭은 행동이나 보상 없이 텍스트, 비디오 또는 둘 모두의 데이터셋에서 사전 학습될 수 있습니다. 그리드 세계에서의 언어 힌트 사용부터 실제 사진처럼 스캔된 가정 환경을 탐색하는 데 이르기까지, 다이내랭은 환경 설명, 게임 규칙, 지침 등 다양한 유형의 언어를 활용하여 작업 성능을 향상시킵니다.

English

To interact with humans in the world, agents need to understand the diverse types of language that people use, relate them to the visual world, and act based on them. While current agents learn to execute simple language instructions from task rewards, we aim to build agents that leverage diverse language that conveys general knowledge, describes the state of the world, provides interactive feedback, and more. Our key idea is that language helps agents predict the future: what will be observed, how the world will behave, and which situations will be rewarded. This perspective unifies language understanding with future prediction as a powerful self-supervised learning objective. We present Dynalang, an agent that learns a multimodal world model that predicts future text and image representations and learns to act from imagined model rollouts. Unlike traditional agents that use language only to predict actions, Dynalang acquires rich language understanding by using past language also to predict future language, video, and rewards. In addition to learning from online interaction in an environment, Dynalang can be pretrained on datasets of text, video, or both without actions or rewards. From using language hints in grid worlds to navigating photorealistic scans of homes, Dynalang utilizes diverse types of language to improve task performance, including environment descriptions, game rules, and instructions.

언어를 통해 세계를 모델링하는 방법 학습하기

Learning to Model the World with Language

초록

Support