NORA-1.5:基于世界模型与行为偏好奖励训练的视觉-语言-行为模型
NORA-1.5: A Vision-Language-Action Model Trained using World Model- and Action-based Preference Rewards
November 18, 2025
作者: Chia-Yu Hung, Navonil Majumder, Haoyuan Deng, Liu Renhang, Yankang Ang, Amir Zadeh, Chuan Li, Dorien Herremans, Ziwei Wang, Soujanya Poria
cs.AI
摘要
视觉-语言-动作(VLA)模型近期在各类具身任务中展现出良好性能,但其可靠性与泛化能力仍存在不足,尤其在跨智能体部署或真实环境应用时表现明显。本研究基于预训练的NORA主干网络,通过引入基于流匹配的动作专家模块,构建了NORA-1.5模型。仅此架构增强就带来了显著的性能提升,使NORA-1.5在仿真与真实场景基准测试中均超越原NORA模型及多个前沿VLA模型。为进一步增强鲁棒性与任务完成率,我们开发了一套用于后训练VLA策略的奖励模型。该奖励体系融合了:(i)动作条件世界模型(WM),用于评估生成动作是否导向目标;(ii)基于真实轨迹偏差的启发式规则,用于区分动作优劣。利用这些奖励信号,我们构建偏好数据集,并通过直接偏好优化(DPO)使NORA-1.5适配目标智能体。大量实验表明,奖励驱动的后训练能持续提升模型在仿真与真实机器人环境中的表现,通过简洁高效的奖励模型显著增强VLA模型的可靠性。我们的研究证明,NORA-1.5结合奖励引导的后训练是开发现实场景适用、高可信度具身智能体的有效路径。
English
Vision--language--action (VLA) models have recently shown promising performance on a variety of embodied tasks, yet they still fall short in reliability and generalization, especially when deployed across different embodiments or real-world environments. In this work, we introduce NORA-1.5, a VLA model built from the pre-trained NORA backbone by adding to it a flow-matching-based action expert. This architectural enhancement alone yields substantial performance gains, enabling NORA-1.5 to outperform NORA and several state-of-the-art VLA models across both simulated and real-world benchmarks. To further improve robustness and task success, we develop a set of reward models for post-training VLA policies. Our rewards combine (i) an action-conditioned world model (WM) that evaluates whether generated actions lead toward the desired goal, and (ii) a deviation-from-ground-truth heuristic that distinguishes good actions from poor ones. Using these reward signals, we construct preference datasets and adapt NORA-1.5 to target embodiments through direct preference optimization (DPO). Extensive evaluations show that reward-driven post-training consistently improves performance in both simulation and real-robot settings, demonstrating significant VLA model-reliability gains through simple yet effective reward models. Our findings highlight NORA-1.5 and reward-guided post-training as a viable path toward more dependable embodied agents suitable for real-world deployment.