利用語言學習模擬世界

摘要

為了與世界中的人類互動，代理需要理解人們使用的各種語言類型，將其與視覺世界相關聯，並根據它們採取行動。儘管當前的代理從任務獎勵中學習執行簡單的語言指令，我們的目標是建立能夠利用傳達一般知識、描述世界狀態、提供互動反饋等多樣語言的代理。我們的關鍵想法是語言幫助代理預測未來：將觀察到什麼、世界將如何運作以及哪些情況將受到獎勵。這種觀點將語言理解與未來預測統一為一個強大的自監督學習目標。我們提出了Dynalang，一個學習多模態世界模型的代理，該模型預測未來的文本和圖像表示，並從想像的模型展開中學習行動。與僅使用語言預測行動的傳統代理不同，Dynalang通過使用過去的語言來預測未來的語言、視頻和獎勵，獲得豐富的語言理解。除了從環境中的在線互動中學習外，Dynalang還可以在沒有行動或獎勵的情況下預先訓練在文本、視頻或兩者數據集上。從在網格世界中使用語言提示到導航逼真家居掃描，Dynalang利用各種類型的語言來提高任務性能，包括環境描述、遊戲規則和指令。

English

To interact with humans in the world, agents need to understand the diverse types of language that people use, relate them to the visual world, and act based on them. While current agents learn to execute simple language instructions from task rewards, we aim to build agents that leverage diverse language that conveys general knowledge, describes the state of the world, provides interactive feedback, and more. Our key idea is that language helps agents predict the future: what will be observed, how the world will behave, and which situations will be rewarded. This perspective unifies language understanding with future prediction as a powerful self-supervised learning objective. We present Dynalang, an agent that learns a multimodal world model that predicts future text and image representations and learns to act from imagined model rollouts. Unlike traditional agents that use language only to predict actions, Dynalang acquires rich language understanding by using past language also to predict future language, video, and rewards. In addition to learning from online interaction in an environment, Dynalang can be pretrained on datasets of text, video, or both without actions or rewards. From using language hints in grid worlds to navigating photorealistic scans of homes, Dynalang utilizes diverse types of language to improve task performance, including environment descriptions, game rules, and instructions.

利用語言學習模擬世界

Learning to Model the World with Language

摘要

Support