dWorldEval：基于离散扩散世界模型的可扩展机器人策略评估

摘要

现有方法难以在数千种环境和任务中评估机器人策略，这催生了对可扩展评估方法的需求。本文提出dWorldEval——采用离散扩散世界模型作为机器人策略的可扩展评估代理。该框架将视觉、语言和机器人动作等所有模态映射到统一的标记空间，通过基于Transformer的单一去噪网络进行建模。基于此架构，我们采用稀疏关键帧记忆机制保持时空一致性，并引入指示任务完成度的进度标记。在推理时，模型联合预测未来观测值和进度标记，当进度值达到1时可自动判定任务成功。大量实验表明，dWorldEval在LIBERO、RoboTwin及多项真实机器人任务上显著优于WorldEval、Ctrl-World和WorldGym等现有方法，为构建大规模机器人评估的世界模拟器开辟了新范式。

English

Evaluating robotics policies across thousands of environments and thousands of tasks is infeasible with existing approaches. This motivates the need for a new methodology for scalable robotics policy evaluation. In this paper, we propose dWorldEval, which uses a discrete diffusion world model as a scalable evaluation proxy for robotics policies. Specifically, dWorldEval maps all modalities - including vision, language, and robotic actions - into a unified token space, modeling them via a single transformer-based denoising network. In this paper, we propose dWorldEval, using a discrete diffusion world model as a scalable evaluation proxy for robotics policy. Specifically, it maps all modalities, including vision, language, and robotics action into a unified token space, then denoises them with a single transformer network. Building on this architecture, we employ a sparse keyframe memory to maintain spatiotemporal consistency. We also introduce a progress token that indicates the degree of task completion. At inference, the model jointly predicts future observations and progress token, allowing automatically determine success when the progress reaches 1. Extensive experiments demonstrate that dWorldEval significantly outperforms previous approaches, i.e., WorldEval, Ctrl-World, and WorldGym, on LIBERO, RoboTwin, and multiple real-robot tasks. It paves the way for a new architectural paradigm in building world simulators for robotics evaluation at scale.