ChatPaper.aiChatPaper

EBench:通用移动操作策略的基础诊断

EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies

June 20, 2026
作者: Ning Gao, Jinliang Zheng, Xing Gao, Haoxiang Ma, Hanqing Wang, Yukai Wang, Jiantong Chen, Zanxin Chen, Shujie Zhang, Mingda Jia, Xuekun Jiang, Zihou Zhu, Xinyu Li, Shuai Wang, Hao Li, Wenzhe Cai, Yuqiang Yang, Xudong Xu, Zhaoyang Lyu, Yao Mu, Tai Wang, Jiangmiao Pang, Jia Zeng, Weinan Zhang, Chunhua Shen
cs.AI

摘要

我们提出了EBench,这是一个仿真基准测试工具,用于从单一成功率标量之外的维度诊断通用移动操作策略。EBench包含26个多样化且具有挑战性的操作任务,这些任务在5个能力维度和4个泛化维度上进行了标注。我们评估了最先进的通用操作模型,包括π_0、π_{0.5}、XVLA和InternVLA-A1,并揭示了成功率相近的模型展现出截然不同的能力特征:π_{0.5}取得了最高的测试成功率和最佳的训练-测试保持率,而InternVLA-A1在移动操作任务中占主导地位,但在灵巧操作任务上表现不佳;XVLA则在一组与其他模型不重叠的原子技能上展现出优势。除了能力特征分析,EBench还从4个代表性视角分析了泛化能力,识别了不同分布偏移因素的影响。这些结果揭示了模型在总体得分背后各自的优势与不足。我们希望该基准能提供广泛的诊断信号,用于指导通用操作模型的迭代开发。
English
We present EBench, a simulation benchmark that diagnoses generalist mobile manipulation policies beyond a single success-rate scalar. EBench comprises 26 diverse and challenging manipulation tasks annotated along 5 capability dimensions and 4 generalization dimensions. We evaluate state-of-the-art generalist manipulation models including π_0, π_{0.5}, XVLA, and InternVLA-A1, and reveal that models with near success rates exhibit strikingly different capability profiles: π_{0.5} achieves the highest test success rate and the best train--test retention, whereas InternVLA-A1 dominates mobile manipulation but collapses on dexterous tasks, and XVLA exhibits strengths on a disjoint set of atomic skills compared to other policies. Beyond capability profiling, EBench analyzes the generalization ability from 4 representative perspectives, identifying the impact of different distribution shift factors. The results reveal strengths and weaknesses of models behind an overall score. We hope this benchmark offers a broad set of diagnostic signals to guide iteration on generalist manipulation models.