LLaRA：为视觉-语言策略增强机器人学习数据

摘要

配备广泛世界知识和强大推理技能的大型语言模型(LLMs)能够处理跨领域的多样任务，通常通过将其构建为对话式指令-响应对来解决。本文提出了LLaRA：大型语言与机器人助理，这是一个框架，将机器人行动策略表述为对话，并在训练时利用辅助数据以提升策略学习。具备视觉输入的LLMs，即视觉语言模型(VLMs)，能够将状态信息处理为视觉-文本提示，并生成文本中的最佳策略决策。为了训练这样的行动策略VLMs，我们首先引入了一个自动化流程，从现有的行为克隆数据中生成多样高质量的机器人指令数据。通过基于对话式构建的针对机器人任务的数据集合对VLM进行微调，可以生成有意义的机器人行动策略决策。我们在多个模拟和真实环境中进行的实验表明了所提出的LLaRA框架的最新性能。代码、数据集和预训练模型可在https://github.com/LostXine/LLaRA 上获得。

English

Large Language Models (LLMs) equipped with extensive world knowledge and strong reasoning skills can tackle diverse tasks across domains, often by posing them as conversation-style instruction-response pairs. In this paper, we propose LLaRA: Large Language and Robotics Assistant, a framework which formulates robot action policy as conversations, and provides improved responses when trained with auxiliary data that complements policy learning. LLMs with visual inputs, i.e., Vision Language Models (VLMs), have the capacity to process state information as visual-textual prompts and generate optimal policy decisions in text. To train such action policy VLMs, we first introduce an automated pipeline to generate diverse high-quality robotics instruction data from existing behavior cloning data. A VLM finetuned with the resulting collection of datasets based on a conversation-style formulation tailored for robotics tasks, can generate meaningful robot action policy decisions. Our experiments across multiple simulated and real-world environments demonstrate the state-of-the-art performance of the proposed LLaRA framework. The code, datasets, and pretrained models are available at https://github.com/LostXine/LLaRA.

LLaRA：为视觉-语言策略增强机器人学习数据

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

摘要

Support