EBench：通用移动操作策略的基础诊断

摘要

我们提出了EBench，这是一个仿真基准测试工具，用于从单一成功率标量之外的维度诊断通用移动操作策略。EBench包含26个多样化且具有挑战性的操作任务，这些任务在5个能力维度和4个泛化维度上进行了标注。我们评估了最先进的通用操作模型，包括π_0、π_{0.5}、XVLA和InternVLA-A1，并揭示了成功率相近的模型展现出截然不同的能力特征：π_{0.5}取得了最高的测试成功率和最佳的训练-测试保持率，而InternVLA-A1在移动操作任务中占主导地位，但在灵巧操作任务上表现不佳；XVLA则在一组与其他模型不重叠的原子技能上展现出优势。除了能力特征分析，EBench还从4个代表性视角分析了泛化能力，识别了不同分布偏移因素的影响。这些结果揭示了模型在总体得分背后各自的优势与不足。我们希望该基准能提供广泛的诊断信号，用于指导通用操作模型的迭代开发。

English

We present EBench, a simulation benchmark that diagnoses generalist mobile manipulation policies beyond a single success-rate scalar. EBench comprises 26 diverse and challenging manipulation tasks annotated along 5 capability dimensions and 4 generalization dimensions. We evaluate state-of-the-art generalist manipulation models including π_0, π_{0.5}, XVLA, and InternVLA-A1, and reveal that models with near success rates exhibit strikingly different capability profiles: π_{0.5} achieves the highest test success rate and the best train--test retention, whereas InternVLA-A1 dominates mobile manipulation but collapses on dexterous tasks, and XVLA exhibits strengths on a disjoint set of atomic skills compared to other policies. Beyond capability profiling, EBench analyzes the generalization ability from 4 representative perspectives, identifying the impact of different distribution shift factors. The results reveal strengths and weaknesses of models behind an overall score. We hope this benchmark offers a broad set of diagnostic signals to guide iteration on generalist manipulation models.