LLaRA:為視覺語言策略增強機器人學習數據
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
June 28, 2024
作者: Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, Michael S. Ryoo
cs.AI
摘要
擁有豐富世界知識和強大推理能力的大型語言模型(LLMs)可以應對跨領域的各種任務,通常通過將其構建為對話式指令-回應對來處理。本文提出了LLaRA:大型語言和機器人助手,一個框架將機器人行動策略定義為對話,並在訓練時使用補充策略學習的輔助數據以提供改進的回應。具有視覺輸入的LLMs,即視覺語言模型(VLMs),具有處理狀態信息的能力,將其視為視覺-文本提示並生成文本中的最優策略決策。為了訓練此類行動策略VLMs,我們首先介紹了一個自動化流程,從現有的行為克隆數據中生成多樣且高質量的機器人指令數據。通過基於專為機器人任務量身定制的對話式公式的結果數據集對VLM進行微調,可以生成有意義的機器人行動策略決策。我們在多個模擬和真實環境中的實驗證明了所提出的LLaRA框架的最新性能。代碼、數據集和預訓練模型可在https://github.com/LostXine/LLaRA 上找到。
English
Large Language Models (LLMs) equipped with extensive world knowledge and
strong reasoning skills can tackle diverse tasks across domains, often by
posing them as conversation-style instruction-response pairs. In this paper, we
propose LLaRA: Large Language and Robotics Assistant, a framework which
formulates robot action policy as conversations, and provides improved
responses when trained with auxiliary data that complements policy learning.
LLMs with visual inputs, i.e., Vision Language Models (VLMs), have the capacity
to process state information as visual-textual prompts and generate optimal
policy decisions in text. To train such action policy VLMs, we first introduce
an automated pipeline to generate diverse high-quality robotics instruction
data from existing behavior cloning data. A VLM finetuned with the resulting
collection of datasets based on a conversation-style formulation tailored for
robotics tasks, can generate meaningful robot action policy decisions. Our
experiments across multiple simulated and real-world environments demonstrate
the state-of-the-art performance of the proposed LLaRA framework. The code,
datasets, and pretrained models are available at
https://github.com/LostXine/LLaRA.Summary
AI-Generated Summary