VLA-R1: 視覚-言語-行動モデルの推論能力の強化

要旨

Vision-Language-Action (VLA) モデルは、知覚、言語理解、および行動生成を統合し、エンボディードAIに広範な影響を与える強力なクロスタスクおよびクロスシーン汎化を提供することを目指しています。しかし、現在のVLAモデルは、明示的なステップバイステップの推論を欠いており、アフォーダンス制約や幾何学的関係を考慮せずに最終的な行動を出力することが多いです。また、そのポストトレーニングパイプラインも、主に弱い報酬設計に基づく教師ありファインチューニングに依存しており、推論の品質を強化することはほとんどありません。これらの課題に対処するため、我々はVLA-R1を提案します。これは、検証可能な報酬からの強化学習（RLVR）とグループ相対ポリシー最適化（GRPO）を統合し、推論と実行を体系的に最適化する推論強化型VLAです。具体的には、領域整合性、軌道一貫性、および出力フォーマットのための検証可能な報酬に基づくRLVRポストトレーニング戦略を設計し、推論の堅牢性と実行の精度を強化します。さらに、アフォーダンスと軌道アノテーションに明示的に整合したチェーンオブソート（CoT）監視を提供する高品質なデータセットVLA-CoT-13Kを開発しました。さらに、ドメイン内、ドメイン外、シミュレーション、および実ロボットプラットフォームでの広範な評価により、VLA-R1が従来のVLA手法と比較して優れた汎化性能と実世界での性能を達成することが示されました。本論文の公開後、モデル、コード、およびデータセットを公開する予定です。コード: https://github.com/GigaAI-research/VLA-R1. ウェブサイト: https://gigaai-research.github.io/VLA-R1.

English

Vision-Language-Action (VLA) models aim to unify perception, language understanding, and action generation, offering strong cross-task and cross-scene generalization with broad impact on embodied AI. However, current VLA models often lack explicit step-by-step reasoning, instead emitting final actions without considering affordance constraints or geometric relations. Their post-training pipelines also rarely reinforce reasoning quality, relying primarily on supervised fine-tuning with weak reward design. To address these challenges, we present VLA-R1, a reasoning-enhanced VLA that integrates Reinforcement Learning from Verifiable Rewards (RLVR) with Group Relative Policy Optimization (GRPO) to systematically optimize both reasoning and execution. Specifically, we design an RLVR-based post-training strategy with verifiable rewards for region alignment, trajectory consistency, and output formatting, thereby strengthening reasoning robustness and execution accuracy. Moreover, we develop VLA-CoT-13K, a high-quality dataset that provides chain-of-thought supervision explicitly aligned with affordance and trajectory annotations. Furthermore, extensive evaluations on in-domain, out-of-domain, simulation, and real-robot platforms demonstrate that VLA-R1 achieves superior generalization and real-world performance compared to prior VLA methods. We plan to release the model, code, and dataset following the publication of this work. Code: https://github.com/GigaAI-research/VLA-R1. Website: https://gigaai-research.github.io/VLA-R1.

VLA-R1: 視覚-言語-行動モデルの推論能力の強化

VLA-R1: Enhancing Reasoning in Vision-Language-Action Models

要旨

Support