ロボット操作のための世界価値モデル

要旨

汎用価値モデルは、大規模で質の混在したデータからのロボットポリシー学習を拡張する上で極めて重要な役割を果たします。数学的に、正確な価値推定には深い時間的洞察が必要であり、モデルは過去の文脈を用いて現在の信念を基盤とし、将来の結果にわたって計画を立てることが求められます。しかし、既存のロボット価値モデルの大部分は、主に静的または時間的に疎な視覚観測で事前学習されたVision-Language Model（VLM）バックボーン上に構築されており、価値推定に必要な時間的モデリング能力を欠いています。VLMとは異なり、世界モデルは時間的モデリングと将来計画に自然に優れており、汎化可能な価値関数を学習するための理想的な基盤となります。この洞察に基づき、我々は世界モデルと価値推定を融合させ、データ品質を評価するための正確なタスク進捗を提供する新しい汎用ロボット価値モデル、World Value Model（WVM）を構築します。標準ベンチマークにおいて、WVMは最先端（SOTA）のValue-Order Correlation（VOC）結果を達成します。エキスパートデータのみを含む標準評価スイートを補完するものとして、我々はさらにSuboptimal-Value-Benchを導入します。これは、高忠実度で人間がラベル付けしたフレームアノテーションを備えた800の準最適な軌道からなるマルチエンボディメントベンチマークです。我々の評価により、WVMはSuboptimal-Value-BenchにおいてもSOTA性能を維持し、エキスパートデータと準最適データの両方を扱う堅牢性を確立しています。ポリシー学習に展開されると、WVMはシミュレーション環境と実世界展開の両方において、さまざまなポリシー抽出手法にわたって操作性能を向上させ、質の混在したデータからの学習に対して堅牢な指針を提供します。

English

Generalist value models play a pivotal role in scaling robotic policy learning from large-scale, mixed-quality data. Mathematically, accurate value estimation demands deep temporal understanding, requiring models to both ground the current belief using historical context and plan over future outcomes. However, most existing robotic value models are built on Vision-Language Model (VLM) backbones that are pretrained primarily on static or temporally sparse visual observations, lacking the requisite temporal modeling capabilities for value estimation. Unlike VLMs, world models naturally excel at temporal modeling and future planning, making them ideal foundations for learning generalizable value functions. Driven by this insight, we marry world models with value estimation to construct a new generalist robotic value model, World Value Model (WVM), that offers accurate task progressions to assess data quality. On standard benchmarks, WVM delivers state-of-the-art (SOTA) Value-Order Correlation (VOC) results. Complementing standard evaluation suites that contains only expert data, we further introduce Suboptimal-Value-Bench, a multi-embodiment benchmark consisting of 800 suboptimal trajectories with high-fidelity, human-labeled frame annotations. Our evaluations show that WVM maintains its SOTA performance on Suboptimal-Value-Bench, establishing its robustness in handling both expert and suboptimal data. When deployed for policy learning, WVM improves manipulation performance across various policy extraction approaches in both simulated and real-world deployment, providing robust guidance for learning from mixed-quality data.