NORA: 具象化タスクのための小型オープンソース汎用視覚言語行動モデル

要旨

既存のVisual-Language-Action（VLA）モデルは、ゼロショットシナリオにおいて有望な性能を示し、印象的なタスク実行能力と推論能力を実証しています。しかし、視覚エンコーディングの制限から生じる課題が大きく、物体把持などのタスク中に失敗が発生する可能性があります。さらに、これらのモデルは通常、7Bパラメータを超える大規模なサイズであるため、高い計算オーバーヘッドに悩まされています。これらのモデルは推論やタスクプランニングに優れていますが、速度と効率が最重要視されるリアルタイムロボティクス環境では、その計算オーバーヘッドの大きさが実用性を損なっています。既存のVLAモデルの限界を克服するため、我々はNORAを提案します。NORAは3Bパラメータのモデルで、計算オーバーヘッドを削減しながら強力なタスク性能を維持するように設計されています。NORAはQwen-2.5-VL-3Bマルチモーダルモデルをバックボーンとして採用し、その優れた視覚-意味理解を活用して視覚推論とアクションの基盤を強化します。さらに、我々のモデルは970kの実世界ロボットデモンストレーションで訓練され、効率的なアクションシーケンス生成のためにFAST+トークナイザーを備えています。実験結果は、NORAが既存の大規模VLAモデルを上回り、計算オーバーヘッドを大幅に削減しながら優れたタスク性能を達成し、リアルタイムロボティクス自律性のためのより実用的なソリューションであることを示しています。

English

Existing Visual-Language-Action (VLA) models have shown promising performance in zero-shot scenarios, demonstrating impressive task execution and reasoning capabilities. However, a significant challenge arises from the limitations of visual encoding, which can result in failures during tasks such as object grasping. Moreover, these models typically suffer from high computational overhead due to their large sizes, often exceeding 7B parameters. While these models excel in reasoning and task planning, the substantial computational overhead they incur makes them impractical for real-time robotic environments, where speed and efficiency are paramount. To address the limitations of existing VLA models, we propose NORA, a 3B-parameter model designed to reduce computational overhead while maintaining strong task performance. NORA adopts the Qwen-2.5-VL-3B multimodal model as its backbone, leveraging its superior visual-semantic understanding to enhance visual reasoning and action grounding. Additionally, our is trained on 970k real-world robot demonstrations and equipped with the FAST+ tokenizer for efficient action sequence generation. Experimental results demonstrate that NORA outperforms existing large-scale VLA models, achieving better task performance with significantly reduced computational overhead, making it a more practical solution for real-time robotic autonomy.

NORA: 具象化タスクのための小型オープンソース汎用視覚言語行動モデル

NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

要旨

Support