言語を用いて世界をモデル化する学習

要旨

世界で人間と相互作用するためには、エージェントが人々が使用する多様な種類の言語を理解し、それを視覚的な世界に関連付け、それに基づいて行動する必要があります。現在のエージェントはタスクの報酬から単純な言語指示を実行することを学びますが、私たちは一般的な知識を伝える、世界の状態を記述する、インタラクティブなフィードバックを提供するなど、多様な言語を活用するエージェントを構築することを目指しています。私たちの重要なアイデアは、言語がエージェントに未来を予測するのに役立つということです：何が観察されるか、世界がどのように振る舞うか、どの状況が報酬を得るか。この視点は、言語理解と未来予測を強力な自己教師あり学習の目的として統一します。私たちは、Dynalangというエージェントを紹介します。これは、将来のテキストと画像の表現を予測し、想像されたモデルのロールアウトから行動を学ぶマルチモーダルな世界モデルを学習するエージェントです。従来のエージェントが言語を行動の予測にのみ使用するのとは異なり、Dynalangは過去の言語を使用して将来の言語、ビデオ、報酬を予測することで、豊かな言語理解を獲得します。環境内でのオンライン相互作用から学習するだけでなく、Dynalangは行動や報酬なしでテキスト、ビデオ、またはその両方のデータセットで事前学習することができます。グリッドワールドでの言語のヒントの使用から、家庭のフォトリアリスティックなスキャンをナビゲートするまで、Dynalangは環境の記述、ゲームのルール、指示など、多様な種類の言語を活用してタスクのパフォーマンスを向上させます。

English

To interact with humans in the world, agents need to understand the diverse types of language that people use, relate them to the visual world, and act based on them. While current agents learn to execute simple language instructions from task rewards, we aim to build agents that leverage diverse language that conveys general knowledge, describes the state of the world, provides interactive feedback, and more. Our key idea is that language helps agents predict the future: what will be observed, how the world will behave, and which situations will be rewarded. This perspective unifies language understanding with future prediction as a powerful self-supervised learning objective. We present Dynalang, an agent that learns a multimodal world model that predicts future text and image representations and learns to act from imagined model rollouts. Unlike traditional agents that use language only to predict actions, Dynalang acquires rich language understanding by using past language also to predict future language, video, and rewards. In addition to learning from online interaction in an environment, Dynalang can be pretrained on datasets of text, video, or both without actions or rewards. From using language hints in grid worlds to navigating photorealistic scans of homes, Dynalang utilizes diverse types of language to improve task performance, including environment descriptions, game rules, and instructions.

言語を用いて世界をモデル化する学習

Learning to Model the World with Language

要旨

Support