NORA: 구체화된 작업을 위한 소규모 오픈소스 일반 목적 시각-언어-행동 모델

초록

기존의 시각-언어-행동(VLA) 모델들은 제로샷 시나리오에서 유망한 성능을 보이며, 인상적인 작업 실행 및 추론 능력을 입증했습니다. 그러나 시각 인코딩의 한계로 인해 물체 파지와 같은 작업에서 실패가 발생할 수 있다는 중요한 문제가 존재합니다. 또한, 이러한 모델들은 대개 70억 개 이상의 파라미터를 가진 대규모 모델로 인해 높은 계산 오버헤드를 겪는 경향이 있습니다. 이러한 모델들은 추론 및 작업 계획에서 뛰어난 성능을 보이지만, 실시간 로봇 환경에서는 속도와 효율성이 가장 중요한 요소이기 때문에, 이들의 상당한 계산 오버헤드는 실용적이지 못한 단점으로 작용합니다. 기존 VLA 모델의 한계를 해결하기 위해, 우리는 계산 오버헤드를 줄이면서도 강력한 작업 성능을 유지하는 30억 파라미터 모델인 NORA를 제안합니다. NORA는 Qwen-2.5-VL-3B 멀티모달 모델을 백본으로 채택하여, 우수한 시각-의미 이해 능력을 활용해 시각 추론 및 행동 기반을 강화합니다. 또한, NORA는 970,000개의 실제 로봇 데모 데이터로 학습되었으며, 효율적인 행동 시퀀스 생성을 위해 FAST+ 토크나이저를 장착했습니다. 실험 결과, NORA는 기존 대규모 VLA 모델들을 능가하며, 계산 오버헤드를 크게 줄이면서도 더 나은 작업 성능을 달성함으로써, 실시간 로봇 자율성을 위한 더 실용적인 솔루션임을 입증했습니다.

English

Existing Visual-Language-Action (VLA) models have shown promising performance in zero-shot scenarios, demonstrating impressive task execution and reasoning capabilities. However, a significant challenge arises from the limitations of visual encoding, which can result in failures during tasks such as object grasping. Moreover, these models typically suffer from high computational overhead due to their large sizes, often exceeding 7B parameters. While these models excel in reasoning and task planning, the substantial computational overhead they incur makes them impractical for real-time robotic environments, where speed and efficiency are paramount. To address the limitations of existing VLA models, we propose NORA, a 3B-parameter model designed to reduce computational overhead while maintaining strong task performance. NORA adopts the Qwen-2.5-VL-3B multimodal model as its backbone, leveraging its superior visual-semantic understanding to enhance visual reasoning and action grounding. Additionally, our is trained on 970k real-world robot demonstrations and equipped with the FAST+ tokenizer for efficient action sequence generation. Experimental results demonstrate that NORA outperforms existing large-scale VLA models, achieving better task performance with significantly reduced computational overhead, making it a more practical solution for real-time robotic autonomy.

NORA: 구체화된 작업을 위한 소규모 오픈소스 일반 목적 시각-언어-행동 모델

NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

초록

Support