ChatPaper.aiChatPaper

RobotArena无限:通过实景到仿真转换实现可扩展的机器人基准测试

RobotArena infty: Scalable Robot Benchmarking via Real-to-Sim Translation

October 27, 2025
作者: Yash Jangir, Yidi Zhang, Kashu Yamazaki, Chenyu Zhang, Kuan-Hsun Tu, Tsung-Wei Ke, Lei Ke, Yonatan Bisk, Katerina Fragkiadaki
cs.AI

摘要

对机器人通才——即能够跨多样环境执行多种任务的可指导智能体——的追求,需要建立严谨且可扩展的评估体系。然而现实世界中的机器人策略测试仍存在根本性局限:人力投入密集、效率低下、大规模测试存在安全隐患且难以复现。现有仿真基准测试同样受限,因其训练与测试均在相同合成领域内进行,无法评估基于真实世界演示或替代仿真环境训练的模型。随着策略范围与复杂度的提升,这些障碍只会加剧,因为机器人领域的"成功"定义往往取决于人类对执行质量的精细判断。本文提出一种新型基准测试框架,通过将视觉语言动作模型评估迁移至结合在线人类反馈的大规模仿真环境,从而突破上述局限。该方法利用视觉语言模型、2D到3D生成建模及可微分渲染等技术进展,将广泛使用的机器人数据集中的视频演示自动转换为仿真环境中的对应场景。在这些数字孪生体中,我们通过自动化视觉语言模型引导评分和从众包工作者收集的可扩展人类偏好判断,对视觉语言动作模型策略进行双重评估,将人类参与从繁琐的场景设置、重置和安全监控转变为轻量级的偏好比较。为衡量鲁棒性,我们沿纹理、物体布局等多维度对仿真环境进行系统性扰动,在受控变量下对策略泛化能力进行压力测试。最终构建出一个持续演进、可复现、可扩展的基准测试体系,专门针对真实世界训练的机器人操作策略,填补了当前机器人技术生态中的关键能力空白。
English
The pursuit of robot generalists - instructable agents capable of performing diverse tasks across diverse environments - demands rigorous and scalable evaluation. Yet real-world testing of robot policies remains fundamentally constrained: it is labor-intensive, slow, unsafe at scale, and difficult to reproduce. Existing simulation benchmarks are similarly limited, as they train and test policies within the same synthetic domains and cannot assess models trained from real-world demonstrations or alternative simulation environments. As policies expand in scope and complexity, these barriers only intensify, since defining "success" in robotics often hinges on nuanced human judgments of execution quality. In this paper, we introduce a new benchmarking framework that overcomes these challenges by shifting VLA evaluation into large-scale simulated environments augmented with online human feedback. Leveraging advances in vision-language models, 2D-to-3D generative modeling, and differentiable rendering, our approach automatically converts video demonstrations from widely used robot datasets into simulated counterparts. Within these digital twins, we assess VLA policies using both automated VLM-guided scoring and scalable human preference judgments collected from crowdworkers, transforming human involvement from tedious scene setup, resetting, and safety supervision into lightweight preference comparisons. To measure robustness, we systematically perturb simulated environments along multiple axes, such as textures and object placements, stress-testing policy generalization under controlled variation. The result is a continuously evolving, reproducible, and scalable benchmark for real-world trained robot manipulation policies, addressing a critical missing capability in today's robotics landscape.
PDF81December 31, 2025