VLANeXt: 強力なVLAモデル構築のためのレシピ

要旨

大規模基盤モデルの台頭に続き、視覚・言語・行動モデル（VLA）が登場し、強力な視覚・言語理解能力を汎用方策学習に活用するようになりました。しかし、現在のVLAの状況は依然として断片的で探索段階にあります。多くの研究グループが独自のVLAモデルを提案しているものの、学習プロトコルと評価設定の不統一により、どの設計選択が真に重要かを特定することが困難になっています。この発展途上の領域に構造をもたらすため、私たちは統一的な枠組みと評価設定のもとでVLAの設計空間を再検討します。RT-2やOpenVLAと同様のシンプルなVLAベースラインから出発し、基礎コンポーネント、知覚の本質、行動モデリングの観点という3次元に沿って設計選択を体系的に分析します。本研究から、強力なVLAモデル構築のための実践的なレシピとなる12の重要な知見を抽出します。この探求の成果が、シンプルかつ効果的なモデルVLANeXtです。VLANeXtは、LIBEROおよびLIBERO-plusベンチマークにおいて従来の最先端手法を上回り、実世界実験でも強力な汎化性能を示します。私たちは、研究コミュニティが私たちの知見を再現し、設計空間を探索し、共有基盤の上に新しいVLAバリアントを構築するための共通プラットフォームとして機能する、統一された使いやすいコードベースを公開する予定です。

English

Following the rise of large foundation models, Vision-Language-Action models (VLAs) emerged, leveraging strong visual and language understanding for general-purpose policy learning. Yet, the current VLA landscape remains fragmented and exploratory. Although many groups have proposed their own VLA models, inconsistencies in training protocols and evaluation settings make it difficult to identify which design choices truly matter. To bring structure to this evolving space, we reexamine the VLA design space under a unified framework and evaluation setup. Starting from a simple VLA baseline similar to RT-2 and OpenVLA, we systematically dissect design choices along three dimensions: foundational components, perception essentials, and action modelling perspectives. From this study, we distill 12 key findings that together form a practical recipe for building strong VLA models. The outcome of this exploration is a simple yet effective model, VLANeXt. VLANeXt outperforms prior state-of-the-art methods on the LIBERO and LIBERO-plus benchmarks and demonstrates strong generalization in real-world experiments. We will release a unified, easy-to-use codebase that serves as a common platform for the community to reproduce our findings, explore the design space, and build new VLA variants on top of a shared foundation.

VLANeXt: 強力なVLAモデル構築のためのレシピ

VLANeXt: Recipes for Building Strong VLA Models

要旨

Support