机器人操控的世界价值模型

摘要

通用价值模型在从大规模、混合质量数据中扩展机器人策略学习方面发挥着关键作用。从数学角度看，精确的价值估计需要深层的时间理解能力，要求模型既能利用历史上下文确立当前信念，又能对未来结果进行规划。然而，现有大多数机器人价值模型基于视觉语言模型（VLM）主干构建，而这些VLM主要在静态或时间稀疏的视觉观察上预训练，缺乏价值估计所需的时间建模能力。与VLM不同，世界模型天然擅长时间建模和未来规划，使其成为学习可泛化价值函数的理想基础。受此启发，我们将世界模型与价值估计相结合，构建了一种新的通用机器人价值模型——世界价值模型（WVM），该模型能够提供精确的任务进展评估以衡量数据质量。在标准基准测试上，WVM在价值序相关性（VOC）指标上取得了最优结果。为补充仅包含专家数据的标准评估套件，我们进一步引入了次优价值基准（Suboptimal-Value-Bench），这是一个包含800条次优轨迹的多实体基准数据集，配有高保真度的人工标注帧级标签。评估表明，WVM在次优价值基准上仍保持最优性能，证明了其在处理专家数据和次优数据时的鲁棒性。在策略学习部署中，WVM在模拟环境和真实场景下均能提升多种策略提取方法的操作性能，为从混合质量数据中学习提供了稳健的指导。

English

Generalist value models play a pivotal role in scaling robotic policy learning from large-scale, mixed-quality data. Mathematically, accurate value estimation demands deep temporal understanding, requiring models to both ground the current belief using historical context and plan over future outcomes. However, most existing robotic value models are built on Vision-Language Model (VLM) backbones that are pretrained primarily on static or temporally sparse visual observations, lacking the requisite temporal modeling capabilities for value estimation. Unlike VLMs, world models naturally excel at temporal modeling and future planning, making them ideal foundations for learning generalizable value functions. Driven by this insight, we marry world models with value estimation to construct a new generalist robotic value model, World Value Model (WVM), that offers accurate task progressions to assess data quality. On standard benchmarks, WVM delivers state-of-the-art (SOTA) Value-Order Correlation (VOC) results. Complementing standard evaluation suites that contains only expert data, we further introduce Suboptimal-Value-Bench, a multi-embodiment benchmark consisting of 800 suboptimal trajectories with high-fidelity, human-labeled frame annotations. Our evaluations show that WVM maintains its SOTA performance on Suboptimal-Value-Bench, establishing its robustness in handling both expert and suboptimal data. When deployed for policy learning, WVM improves manipulation performance across various policy extraction approaches in both simulated and real-world deployment, providing robust guidance for learning from mixed-quality data.