ChatPaper.aiChatPaper

机器人竞技场无限:通过实景到仿真的转换实现可扩展的机器人基准测试

RobotArena infty: Scalable Robot Benchmarking via Real-to-Sim Translation

October 27, 2025
作者: Yash Jangir, Yidi Zhang, Kashu Yamazaki, Chenyu Zhang, Kuan-Hsun Tu, Tsung-Wei Ke, Lei Ke, Yonatan Bisk, Katerina Fragkiadaki
cs.AI

摘要

对机器人通才——即能够跨多样环境执行多种任务的可指导智能体——的追求,需要建立严谨且可扩展的评估体系。然而现实世界中的机器人策略测试仍存在根本性局限:人力投入密集、效率低下、大规模测试存在安全隐患且难以复现。现有仿真基准同样受限,因其仅在相同合成域内训练和测试策略,无法评估基于真实世界演示或替代仿真环境训练的模型。随着策略范围与复杂度的提升,这些障碍只会加剧——毕竟机器人领域的"成功"定义往往取决于人类对执行质量的精细判断。本文提出新型基准框架,通过将视觉语言动作模型评估迁移至结合在线人类反馈的大规模仿真环境,成功突破上述困境。依托视觉语言模型、2D到3D生成建模及可微分渲染等技术进展,我们的方法能自动将广泛使用的机器人数据集中的视频演示转化为仿真环境中的数字孪生体。在这些数字孪生体中,我们既采用自动化视觉语言模型引导评分,又通过众包工人收集可扩展的人类偏好判断,从而将人类参与从繁琐的场景设置、重置和安全监控转变为轻量级的偏好比较。为衡量鲁棒性,我们沿纹理、物体布局等多维度系统扰动仿真环境,在受控变量下对策略泛化能力进行压力测试。最终构建出一个持续演进、可复现、可扩展的基准体系,专门针对真实世界训练的机器人操作策略,填补了当前机器人技术生态中的关键能力空白。
English
The pursuit of robot generalists - instructable agents capable of performing diverse tasks across diverse environments - demands rigorous and scalable evaluation. Yet real-world testing of robot policies remains fundamentally constrained: it is labor-intensive, slow, unsafe at scale, and difficult to reproduce. Existing simulation benchmarks are similarly limited, as they train and test policies within the same synthetic domains and cannot assess models trained from real-world demonstrations or alternative simulation environments. As policies expand in scope and complexity, these barriers only intensify, since defining "success" in robotics often hinges on nuanced human judgments of execution quality. In this paper, we introduce a new benchmarking framework that overcomes these challenges by shifting VLA evaluation into large-scale simulated environments augmented with online human feedback. Leveraging advances in vision-language models, 2D-to-3D generative modeling, and differentiable rendering, our approach automatically converts video demonstrations from widely used robot datasets into simulated counterparts. Within these digital twins, we assess VLA policies using both automated VLM-guided scoring and scalable human preference judgments collected from crowdworkers, transforming human involvement from tedious scene setup, resetting, and safety supervision into lightweight preference comparisons. To measure robustness, we systematically perturb simulated environments along multiple axes, such as textures and object placements, stress-testing policy generalization under controlled variation. The result is a continuously evolving, reproducible, and scalable benchmark for real-world trained robot manipulation policies, addressing a critical missing capability in today's robotics landscape.
PDF81December 31, 2025