RoboLab：面向任务通用策略分析的高保真仿真基准平台

摘要

通用机器人技术的追求已催生出令人瞩目的基础模型，但基于仿真的基准测试因性能快速饱和与缺乏真正泛化测试而仍是瓶颈。现有基准测试常存在训练与评估间的显著领域重叠，导致成功率虚高且难以揭示鲁棒性本质。我们推出RoboLab仿真基准测试框架以应对这些挑战。具体而言，该框架旨在回答两个问题：(1) 通过分析策略在仿真环境中的表现，能在多大程度上理解其真实世界性能；(2) 哪些外部因素在受控扰动下对行为影响最显著。首先，RoboLab支持通过人工编写和LLM生成的方式，在物理逼真与照片级真实的仿真环境中，以机器人及策略无关的形式创建场景与任务。基于此，我们提出包含120项任务的RoboLab-120基准，这些任务按三个能力维度（视觉、流程、关系认知）和三个难度等级进行分类。其次，我们引入对真实世界策略的系统化分析，量化其性能及行为对受控扰动的敏感度，证明高保真仿真可作为分析性能及其外部因素依赖性的有效代理。通过RoboLab评估发现，当前顶尖模型存在显著性能差距。该框架通过提供细粒度指标与可扩展工具集，为评估通用任务机器人策略的真实泛化能力提供了标准化方案。

English

The pursuit of general-purpose robotics has yielded impressive foundation models, yet simulation-based benchmarking remains a bottleneck due to rapid performance saturation and a lack of true generalization testing. Existing benchmarks often exhibit significant domain overlap between training and evaluation, trivializing success rates and obscuring insights into robustness. We introduce RoboLab, a simulation benchmarking framework designed to address these challenges. Concretely, our framework is designed to answer two questions: (1) to what extent can we understand the performance of a real-world policy by analyzing its behavior in simulation, and (2) which external factors most strongly affect that behavior under controlled perturbations. First, RoboLab enables human-authored and LLM-enabled generation of scenes and tasks in a robot- and policy-agnostic manner within a physically realistic and photorealistic simulation. With this, we propose the RoboLab-120 benchmark, consisting of 120 tasks categorized into three competency axes: visual, procedural, relational competency, across three difficulty levels. Second, we introduce a systematic analysis of real-world policies that quantify both their performance and the sensitivity of their behavior to controlled perturbations, indicating that high-fidelity simulation can serve as a proxy for analyzing performance and its dependence on external factors. Evaluation with RoboLab exposes significant performance gap in current state-of-the-art models. By providing granular metrics and a scalable toolset, RoboLab offers a scalable framework for evaluating the true generalization capabilities of task-generalist robotic policies.