dWorldEval: Schaalbaar evalueren van robotbeleid via een discreet diffuus wereldmodel

Samenvatting

Het evalueren van robotica-beleid over duizenden omgevingen en duizenden taken is met bestaande benaderingen onhaalbaar. Dit onderstreept de noodzaak van een nieuwe methodologie voor schaalbare evaluatie van robotica-beleid. In dit artikel stellen we dWorldEval voor, dat een discreet diffuus wereldmodel gebruikt als schaalbare evaluatieproxy voor robotica-beleid. Concreet wijst dWorldEval alle modaliteiten - inclusief visie, taal en robotacties - toe aan een uniforme tokenruimte en modelleert ze via een enkele op transformers gebaseerde denoiseringsmodule. Op deze architectuur voortbordurend, gebruiken we een spaarzaam keyframe-geheugen om spatiotemporele consistentie te waarborgen. We introduceren ook een voortgangstoken dat de mate van taakvoltooiing aangeeft. Tijdens inferentie voorspelt het model gezamenlijk toekomstige observaties en het voortgangstoken, waardoor automatisch succes kan worden bepaald wanneer de voortgang 1 bereikt. Uitgebreide experimenten tonen aan dat dWorldEval aanzienlijk beter presteert dan eerdere benaderingen, zoals WorldEval, Ctrl-World en WorldGym, op LIBERO, RoboTwin en meerdere taken met echte robots. Het baant de weg voor een nieuw architecturaal paradigma in het bouwen van wereldsimulators voor grootschalige robotica-evaluatie.

English

Evaluating robotics policies across thousands of environments and thousands of tasks is infeasible with existing approaches. This motivates the need for a new methodology for scalable robotics policy evaluation. In this paper, we propose dWorldEval, which uses a discrete diffusion world model as a scalable evaluation proxy for robotics policies. Specifically, dWorldEval maps all modalities - including vision, language, and robotic actions - into a unified token space, modeling them via a single transformer-based denoising network. In this paper, we propose dWorldEval, using a discrete diffusion world model as a scalable evaluation proxy for robotics policy. Specifically, it maps all modalities, including vision, language, and robotics action into a unified token space, then denoises them with a single transformer network. Building on this architecture, we employ a sparse keyframe memory to maintain spatiotemporal consistency. We also introduce a progress token that indicates the degree of task completion. At inference, the model jointly predicts future observations and progress token, allowing automatically determine success when the progress reaches 1. Extensive experiments demonstrate that dWorldEval significantly outperforms previous approaches, i.e., WorldEval, Ctrl-World, and WorldGym, on LIBERO, RoboTwin, and multiple real-robot tasks. It paves the way for a new architectural paradigm in building world simulators for robotics evaluation at scale.

dWorldEval: Schaalbaar evalueren van robotbeleid via een discreet diffuus wereldmodel

dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model

Samenvatting

Support