走向具有基础模型的统一智能体

摘要

语言模型和视觉语言模型最近展示了在理解人类意图、推理、场景理解和规划行为等方面的前所未有的能力，以文本形式呈现。在这项工作中，我们探讨了如何嵌入和利用这些能力在强化学习（RL）代理程序中。我们设计了一个以语言作为核心推理工具的框架，探讨了这如何使代理程序能够解决一系列基本的RL挑战，如高效探索、重复使用经验数据、调度技能和从观察中学习，这些传统上需要单独设计的垂直算法。我们在一个稀疏奖励的模拟机器人操作环境中测试了我们的方法，机器人需要堆叠一组物体。我们展示了在探索效率和能够重复使用离线数据集中数据方面相比基线方法的显著性能改进，并说明了如何重复使用学到的技能来解决新任务或模仿人类专家的视频。

English

Language Models and Vision Language Models have recently demonstrated unprecedented capabilities in terms of understanding human intentions, reasoning, scene understanding, and planning-like behaviour, in text form, among many others. In this work, we investigate how to embed and leverage such abilities in Reinforcement Learning (RL) agents. We design a framework that uses language as the core reasoning tool, exploring how this enables an agent to tackle a series of fundamental RL challenges, such as efficient exploration, reusing experience data, scheduling skills, and learning from observations, which traditionally require separate, vertically designed algorithms. We test our method on a sparse-reward simulated robotic manipulation environment, where a robot needs to stack a set of objects. We demonstrate substantial performance improvements over baselines in exploration efficiency and ability to reuse data from offline datasets, and illustrate how to reuse learned skills to solve novel tasks or imitate videos of human experts.

走向具有基础模型的统一智能体

Towards A Unified Agent with Foundation Models

摘要

Support