NORA-1.5:基於世界模型與行動偏好獎勵訓練的視覺-語言-行動模型
NORA-1.5: A Vision-Language-Action Model Trained using World Model- and Action-based Preference Rewards
November 18, 2025
作者: Chia-Yu Hung, Navonil Majumder, Haoyuan Deng, Liu Renhang, Yankang Ang, Amir Zadeh, Chuan Li, Dorien Herremans, Ziwei Wang, Soujanya Poria
cs.AI
摘要
視覺-語言-動作(VLA)模型近期在多種具身任務中展現出優異性能,但其可靠性和泛化能力仍有不足,尤其在跨具身系統或真實環境部署時更為明顯。本研究基於預訓練的NORA骨幹模型,通過引入基於流匹配的動作專家模塊,構建了NORA-1.5模型。僅此架構增強就帶來了顯著的性能提升,使NORA-1.5在仿真與真實場景基準測試中均超越NORA及多個前沿VLA模型。為進一步提升魯棒性與任務成功率,我們開發了一套用於後訓練VLA策略的獎勵模型。該獎勵機制融合了:(i)動作條件世界模型(WM),用於評估生成動作是否導向目標;(ii)偏離真實值啟發式規則,用於區分優劣動作。利用這些獎勵信號,我們構建偏好數據集,並通過直接偏好優化(DPO)使NORA-1.5適應目標具身系統。大量實驗表明,獎勵驅動的後訓練能持續提升仿真與真實機器人場景的性能,驗證了通過簡潔有效的獎勵模型可顯著增強VLA模型可靠性。我們的研究成果凸顯了NORA-1.5與獎勵引導後訓練相結合,是構建適用於真實場景的可靠具身智能體的有效路徑。
English
Vision--language--action (VLA) models have recently shown promising performance on a variety of embodied tasks, yet they still fall short in reliability and generalization, especially when deployed across different embodiments or real-world environments. In this work, we introduce NORA-1.5, a VLA model built from the pre-trained NORA backbone by adding to it a flow-matching-based action expert. This architectural enhancement alone yields substantial performance gains, enabling NORA-1.5 to outperform NORA and several state-of-the-art VLA models across both simulated and real-world benchmarks. To further improve robustness and task success, we develop a set of reward models for post-training VLA policies. Our rewards combine (i) an action-conditioned world model (WM) that evaluates whether generated actions lead toward the desired goal, and (ii) a deviation-from-ground-truth heuristic that distinguishes good actions from poor ones. Using these reward signals, we construct preference datasets and adapt NORA-1.5 to target embodiments through direct preference optimization (DPO). Extensive evaluations show that reward-driven post-training consistently improves performance in both simulation and real-robot settings, demonstrating significant VLA model-reliability gains through simple yet effective reward models. Our findings highlight NORA-1.5 and reward-guided post-training as a viable path toward more dependable embodied agents suitable for real-world deployment.