NORA：一款面向具身任务的小型开源通用视觉语言动作模型

摘要

现有的视觉-语言-动作（VLA）模型在零样本场景中展现了卓越的性能，表现出令人印象深刻的任务执行与推理能力。然而，视觉编码的局限性带来了重大挑战，可能导致诸如物体抓取等任务失败。此外，这些模型通常因规模庞大（参数往往超过70亿）而面临高计算开销。尽管这些模型在推理与任务规划方面表现出色，但其带来的巨大计算开销使其难以适用于实时机器人环境，而后者对速度和效率有着极高要求。为应对现有VLA模型的局限，我们提出了NORA，一个拥有30亿参数的模型，旨在降低计算开销的同时保持强大的任务性能。NORA采用Qwen-2.5-VL-3B多模态模型作为其主干，利用其卓越的视觉语义理解能力来增强视觉推理与动作定位。此外，我们的模型在97万次真实世界机器人演示数据上进行训练，并配备了FAST+分词器，以实现高效的动作序列生成。实验结果表明，NORA在显著减少计算开销的同时，任务性能优于现有的大规模VLA模型，使其成为实时机器人自主性更为实用的解决方案。

English

Existing Visual-Language-Action (VLA) models have shown promising performance in zero-shot scenarios, demonstrating impressive task execution and reasoning capabilities. However, a significant challenge arises from the limitations of visual encoding, which can result in failures during tasks such as object grasping. Moreover, these models typically suffer from high computational overhead due to their large sizes, often exceeding 7B parameters. While these models excel in reasoning and task planning, the substantial computational overhead they incur makes them impractical for real-time robotic environments, where speed and efficiency are paramount. To address the limitations of existing VLA models, we propose NORA, a 3B-parameter model designed to reduce computational overhead while maintaining strong task performance. NORA adopts the Qwen-2.5-VL-3B multimodal model as its backbone, leveraging its superior visual-semantic understanding to enhance visual reasoning and action grounding. Additionally, our is trained on 970k real-world robot demonstrations and equipped with the FAST+ tokenizer for efficient action sequence generation. Experimental results demonstrate that NORA outperforms existing large-scale VLA models, achieving better task performance with significantly reduced computational overhead, making it a more practical solution for real-time robotic autonomy.

NORA：一款面向具身任务的小型开源通用视觉语言动作模型

NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

摘要

Support