EBench: 汎用移動操作ポリシーの要素診断

要旨

我々は、シミュレーションベンチマーク「EBench」を提案する。これは、単一の成功率スカラーを超えて、汎用移動操作ポリシーを診断するものである。EBenchは、5つの能力次元と4つの汎化次元に沿って注釈付けられた、多様で挑戦的な26の操作タスクから構成される。我々は、π_0、π_{0.5}、XVLA、InternVLA-A1を含む最先端の汎用操作モデルを評価し、成功率が近いモデルでも、能力プロファイルが著しく異なることを明らかにした。π_{0.5}は最高のテスト成功率と最良の訓練-テスト保持率を達成する一方、InternVLA-A1は移動操作で優位を示すが器用なタスクでは性能が低下し、XVLAは他のポリシーと比較して非連続的な原子スキルセットに強みを示す。能力プロファイリングに加え、EBenchは4つの代表的な視点から汎化能力を分析し、異なる分布シフト要因の影響を特定する。結果は、全体的なスコアの背後にあるモデルの長所と短所を明らかにする。このベンチマークが、汎用操作モデルの反復的改善を導くための広範な診断シグナルを提供することを期待する。

English

We present EBench, a simulation benchmark that diagnoses generalist mobile manipulation policies beyond a single success-rate scalar. EBench comprises 26 diverse and challenging manipulation tasks annotated along 5 capability dimensions and 4 generalization dimensions. We evaluate state-of-the-art generalist manipulation models including π_0, π_{0.5}, XVLA, and InternVLA-A1, and reveal that models with near success rates exhibit strikingly different capability profiles: π_{0.5} achieves the highest test success rate and the best train--test retention, whereas InternVLA-A1 dominates mobile manipulation but collapses on dexterous tasks, and XVLA exhibits strengths on a disjoint set of atomic skills compared to other policies. Beyond capability profiling, EBench analyzes the generalization ability from 4 representative perspectives, identifying the impact of different distribution shift factors. The results reveal strengths and weaknesses of models behind an overall score. We hope this benchmark offers a broad set of diagnostic signals to guide iteration on generalist manipulation models.