LLaRA：為視覺語言策略增強機器人學習數據

摘要

擁有豐富世界知識和強大推理能力的大型語言模型（LLMs）可以應對跨領域的各種任務，通常通過將其構建為對話式指令-回應對來處理。本文提出了LLaRA：大型語言和機器人助手，一個框架將機器人行動策略定義為對話，並在訓練時使用補充策略學習的輔助數據以提供改進的回應。具有視覺輸入的LLMs，即視覺語言模型（VLMs），具有處理狀態信息的能力，將其視為視覺-文本提示並生成文本中的最優策略決策。為了訓練此類行動策略VLMs，我們首先介紹了一個自動化流程，從現有的行為克隆數據中生成多樣且高質量的機器人指令數據。通過基於專為機器人任務量身定制的對話式公式的結果數據集對VLM進行微調，可以生成有意義的機器人行動策略決策。我們在多個模擬和真實環境中的實驗證明了所提出的LLaRA框架的最新性能。代碼、數據集和預訓練模型可在https://github.com/LostXine/LLaRA 上找到。

English

Large Language Models (LLMs) equipped with extensive world knowledge and strong reasoning skills can tackle diverse tasks across domains, often by posing them as conversation-style instruction-response pairs. In this paper, we propose LLaRA: Large Language and Robotics Assistant, a framework which formulates robot action policy as conversations, and provides improved responses when trained with auxiliary data that complements policy learning. LLMs with visual inputs, i.e., Vision Language Models (VLMs), have the capacity to process state information as visual-textual prompts and generate optimal policy decisions in text. To train such action policy VLMs, we first introduce an automated pipeline to generate diverse high-quality robotics instruction data from existing behavior cloning data. A VLM finetuned with the resulting collection of datasets based on a conversation-style formulation tailored for robotics tasks, can generate meaningful robot action policy decisions. Our experiments across multiple simulated and real-world environments demonstrate the state-of-the-art performance of the proposed LLaRA framework. The code, datasets, and pretrained models are available at https://github.com/LostXine/LLaRA.

LLaRA：為視覺語言策略增強機器人學習數據

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

摘要

Support