VLANeXt：构建强大视觉语言模型的方法论

摘要

随着大型基础模型的兴起，视觉-语言-动作模型（VLA）应运而生，其通过强大的视觉与语言理解能力实现通用策略学习。然而当前VLA领域仍处于碎片化探索阶段。尽管多个团队已提出各自的VLA模型，但训练方案与评估标准的不统一使得关键设计要素难以甄别。为构建这一演进领域的系统框架，我们在统一范式下重新审视VLA的设计空间：从类似RT-2和OpenVLA的简易基线出发，沿基础架构、感知要素、动作建模三大维度系统剖析设计选择。通过此项研究，我们提炼出12项关键发现，共同构成构建强效VLA模型的实用方案。探索的成果是简洁高效的VLANeXt模型——该模型在LIBERO和LIBERO-plus基准测试中超越现有最优方法，并在真实世界实验中展现卓越的泛化能力。我们将发布统一易用的代码库，作为社区复现研究成果、探索设计空间、基于共享基础构建新VLA变体的公共平台。

English

Following the rise of large foundation models, Vision-Language-Action models (VLAs) emerged, leveraging strong visual and language understanding for general-purpose policy learning. Yet, the current VLA landscape remains fragmented and exploratory. Although many groups have proposed their own VLA models, inconsistencies in training protocols and evaluation settings make it difficult to identify which design choices truly matter. To bring structure to this evolving space, we reexamine the VLA design space under a unified framework and evaluation setup. Starting from a simple VLA baseline similar to RT-2 and OpenVLA, we systematically dissect design choices along three dimensions: foundational components, perception essentials, and action modelling perspectives. From this study, we distill 12 key findings that together form a practical recipe for building strong VLA models. The outcome of this exploration is a simple yet effective model, VLANeXt. VLANeXt outperforms prior state-of-the-art methods on the LIBERO and LIBERO-plus benchmarks and demonstrates strong generalization in real-world experiments. We will release a unified, easy-to-use codebase that serves as a common platform for the community to reproduce our findings, explore the design space, and build new VLA variants on top of a shared foundation.

VLANeXt：构建强大视觉语言模型的方法论

VLANeXt: Recipes for Building Strong VLA Models

摘要

Support