dWorldEval: 離散拡散世界モデルによるスケーラブルなロボット政策評価

要旨

既存の手法では、数千の環境と数千のタスクにわたるロボティクスポリシーの評価は非現実的である。このことが、スケーラブルなロボティクスポリシー評価のための新たな方法論の必要性を動機付けている。本論文では、離散拡散世界モデルをロボティクスポリシーのスケーラブルな評価プロキシとして利用するdWorldEvalを提案する。具体的には、dWorldEvalは視覚、言語、ロボット動作を含む全てのモダリティを統一されたトークン空間に写像し、トランスフォーマーベースの単一のノイズ除去ネットワークを通じてそれらをモデル化する。本論文では、スケーラブルなロボティクスポリシー評価プロキシとして離散拡散世界モデルを用いるdWorldEvalを提案する。具体的には、視覚、言語、ロボット動作を含む全てのモダリティを統一されたトークン空間に写像し、単一のトランスフォーマーネットワークでノイズ除去を行う。このアーキテクチャに基づき、時空間的一貫性を維持するためにスパースキーフレームメモリを採用する。さらに、タスクの完了度合いを示す進捗トークンを導入する。推論時、モデルは将来の観測と進捗トークンを同時に予測し、進捗が1に達した時に成功を自動的に判定することを可能にする。大規模な実験により、dWorldEvalがLIBERO、RoboTwin、および複数の実ロボットタスクにおいて、従来手法（WorldEval、Ctrl-World、WorldGym）を大幅に上回ることを実証する。これは、大規模なロボティクス評価のための世界シミュレータを構築する新たなアーキテクチャのパラダイムへの道を開くものである。

English

Evaluating robotics policies across thousands of environments and thousands of tasks is infeasible with existing approaches. This motivates the need for a new methodology for scalable robotics policy evaluation. In this paper, we propose dWorldEval, which uses a discrete diffusion world model as a scalable evaluation proxy for robotics policies. Specifically, dWorldEval maps all modalities - including vision, language, and robotic actions - into a unified token space, modeling them via a single transformer-based denoising network. In this paper, we propose dWorldEval, using a discrete diffusion world model as a scalable evaluation proxy for robotics policy. Specifically, it maps all modalities, including vision, language, and robotics action into a unified token space, then denoises them with a single transformer network. Building on this architecture, we employ a sparse keyframe memory to maintain spatiotemporal consistency. We also introduce a progress token that indicates the degree of task completion. At inference, the model jointly predicts future observations and progress token, allowing automatically determine success when the progress reaches 1. Extensive experiments demonstrate that dWorldEval significantly outperforms previous approaches, i.e., WorldEval, Ctrl-World, and WorldGym, on LIBERO, RoboTwin, and multiple real-robot tasks. It paves the way for a new architectural paradigm in building world simulators for robotics evaluation at scale.

dWorldEval: 離散拡散世界モデルによるスケーラブルなロボット政策評価

dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model

要旨

Support