ChatPaper.aiChatPaper

dWorldEval:基于离散扩散世界模型的可扩展机器人策略评估框架

dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model

April 24, 2026
作者: Yaxuan Li, Zhongyi Zhou, Yefei Chen, Yaokai Xue, Yichen Zhu
cs.AI

摘要

現有方法難以在數千種環境和任務中評估機器人策略,這凸顯了建立可擴展機器人策略評估新方法的迫切性。為此,我們提出dWorldEval——一種基於離散擴散世界模型的機器人策略可擴展評估代理框架。該框架將視覺、語言和機器人動作等多模態數據映射至統一標記空間,並通過單一基於Transformer的去噪網絡進行建模。在此基礎上,我們採用稀疏關鍵幀記憶機制維持時空一致性,並引入標示任務完成度的進度標記。推理時,模型聯合預測未來觀測值與進度標記,當進度值達到1時即可自動判定任務成功。大量實驗表明,dWorldEval在LIBERO、RoboTwin及多項真實機器人任務上的評估性能顯著超越WorldEval、Ctrl-World和WorldGym等現有方法,為構建大規模機器人評估用世界模擬器開辟了新的架構範式。
English
Evaluating robotics policies across thousands of environments and thousands of tasks is infeasible with existing approaches. This motivates the need for a new methodology for scalable robotics policy evaluation. In this paper, we propose dWorldEval, which uses a discrete diffusion world model as a scalable evaluation proxy for robotics policies. Specifically, dWorldEval maps all modalities - including vision, language, and robotic actions - into a unified token space, modeling them via a single transformer-based denoising network. In this paper, we propose dWorldEval, using a discrete diffusion world model as a scalable evaluation proxy for robotics policy. Specifically, it maps all modalities, including vision, language, and robotics action into a unified token space, then denoises them with a single transformer network. Building on this architecture, we employ a sparse keyframe memory to maintain spatiotemporal consistency. We also introduce a progress token that indicates the degree of task completion. At inference, the model jointly predicts future observations and progress token, allowing automatically determine success when the progress reaches 1. Extensive experiments demonstrate that dWorldEval significantly outperforms previous approaches, i.e., WorldEval, Ctrl-World, and WorldGym, on LIBERO, RoboTwin, and multiple real-robot tasks. It paves the way for a new architectural paradigm in building world simulators for robotics evaluation at scale.
PDF20April 28, 2026