DeeR-VLA: 効率的なロボット実行のためのマルチモーダル大規模言語モデルの動的推論

要旨

MLLMは、複雑な言語と視覚データに対する驚異的な理解力と推論能力を示しています。これらの進歩は、複雑な人間の指示を理解し、さまざまな具体的なタスクを遂行することに長けた一般的なロボットMLLMの構築のビジョンを促進しています。ただし、実世界のロボット用にMLLMを開発することは、通常のロボットプラットフォームで利用可能な計算能力とメモリ容量が限られているため、困難です。一方、MLLMの推論には数十億のパラメータを保存し、膨大な計算を行う必要があり、膨大なハードウェア要件が課せられます。本論文では、特定の状況に応じてアクティブ化されるMLLMのサイズを自動的に調整するロボティックビジョン言語アクションモデル（DeeR-VLA、または単にDeeR）のためのダイナミックアーリーエグジットフレームワークを提案しています。この手法は、MLLMにマルチエグジットアーキテクチャを活用し、モデルの適切なサイズが特定の状況にアクティブ化された後に処理を終了させることで、さらなる冗長な計算を回避します。さらに、DeeRのための早期終了基準を確立する新しいアルゴリズムを開発し、平均計算コスト（つまり、消費電力）、ピーク計算消費（つまり、レイテンシ）、およびGPUメモリ使用量などの事前定義された要求に応じて条件付けます。これらの改良により、DeeRは競争力のある性能を維持しながら、さまざまなリソース制約下で効率的に動作します。CALVINロボット操作ベンチマークでは、DeeRは、性能を損なうことなく、LLMの計算コストを5.2〜6.5倍、LLMのGPUメモリを2〜6倍削減することを示しています。コードとチェックポイントは、https://github.com/yueyang130/DeeR-VLA で入手可能です。

English

MLLMs have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data. These advances have spurred the vision of establishing a generalist robotic MLLM proficient in understanding complex human instructions and accomplishing various embodied tasks. However, developing MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. In contrast, the inference of MLLMs involves storing billions of parameters and performing tremendous computation, imposing significant hardware demands. In our paper, we propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR-VLA, or simply DeeR) that automatically adjusts the size of the activated MLLM based on each situation at hand. The approach leverages a multi-exit architecture in MLLMs, which allows the model to terminate processing once a proper size of the model has been activated for a specific situation, thus avoiding further redundant computation. Additionally, we develop novel algorithms that establish early-termination criteria for DeeR, conditioned on predefined demands such as average computational cost (i.e., power consumption), as well as peak computational consumption (i.e., latency) and GPU memory usage. These enhancements ensure that DeeR operates efficiently under varying resource constraints while maintaining competitive performance. On the CALVIN robot manipulation benchmark, DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance. Code and checkpoints are available at https://github.com/yueyang130/DeeR-VLA.

DeeR-VLA: 効率的なロボット実行のためのマルチモーダル大規模言語モデルの動的推論

DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution

要旨

Support