VLANeXt：構建強大視覺語言動作模型的方法指南

摘要

隨著大型基礎模型的崛起，視覺語言動作模型（VLA）應運而生，其利用強大的視覺與語言理解能力實現通用策略學習。然而，當前VLA領域仍處於碎片化探索階段。儘管多個團隊已提出各自的VLA模型，但訓練方案與評估設定的不一致性使得關鍵設計選擇難以辨識。為釐清這一快速發展的領域，我們在統一框架與評估體系下重新審視VLA的設計空間。從類似RT-2和OpenVLA的簡易VLA基線出發，我們沿三個維度系統性剖析設計選擇：基礎組件、感知要素與動作建模視角。透過此研究，我們提煉出12項關鍵發現，共同構成建構強健VLA模型的實用指南。此探索的成果是簡潔高效的VLANeXt模型——該模型在LIBERO與LIBERO-plus基準測試中超越先前最先進方法，並在真實世界實驗中展現卓越泛化能力。我們將發布統一的易用程式碼庫，作為社群復現研究發現、探索設計空間及基於共享基礎建構新VLA變體的共通平台。

English

Following the rise of large foundation models, Vision-Language-Action models (VLAs) emerged, leveraging strong visual and language understanding for general-purpose policy learning. Yet, the current VLA landscape remains fragmented and exploratory. Although many groups have proposed their own VLA models, inconsistencies in training protocols and evaluation settings make it difficult to identify which design choices truly matter. To bring structure to this evolving space, we reexamine the VLA design space under a unified framework and evaluation setup. Starting from a simple VLA baseline similar to RT-2 and OpenVLA, we systematically dissect design choices along three dimensions: foundational components, perception essentials, and action modelling perspectives. From this study, we distill 12 key findings that together form a practical recipe for building strong VLA models. The outcome of this exploration is a simple yet effective model, VLANeXt. VLANeXt outperforms prior state-of-the-art methods on the LIBERO and LIBERO-plus benchmarks and demonstrates strong generalization in real-world experiments. We will release a unified, easy-to-use codebase that serves as a common platform for the community to reproduce our findings, explore the design space, and build new VLA variants on top of a shared foundation.

VLANeXt：構建強大視覺語言動作模型的方法指南

VLANeXt: Recipes for Building Strong VLA Models

摘要

Support