EBench: Elementaire Diagnose van Generalistisch Mobiel Manipulatiebeleid

Samenvatting

We presenteren EBench, een simulatiebenchmark die generalistische mobiele manipulatiebeleidsstrategieën diagnosticeert voorbij een enkele succespercentagescalair. EBench omvat 26 diverse en uitdagende manipulatie taken die zijn geannoteerd langs 5 capaciteitsdimensies en 4 generalisatiedimensies. We evalueren state-of-the-art generalistische manipulatiemodellen, waaronder π_0, π_{0.5}, XVLA en InternVLA-A1, en onthullen dat modellen met vergelijkbare succespercentages opvallend verschillende capaciteitsprofielen vertonen: π_{0.5} behaalt het hoogste testsuccespercentage en de beste train-test retentie, terwijl InternVLA-A1 domineert in mobiele manipulatie maar faalt bij behendigheidstaken, en XVLA sterke punten vertoont in een disjuncte set van atomaire vaardigheden vergeleken met andere beleidsvormen. Naast capaciteitsprofilering analyseert EBench het generalisatievermogen vanuit 4 representatieve perspectieven, waarbij de impact van verschillende distributieverschuivingsfactoren wordt geïdentificeerd. De resultaten onthullen sterke en zwakke punten van modellen achter een algemene score. We hopen dat deze benchmark een brede set diagnostische signalen biedt om iteratie op generalistische manipulatiemodellen te begeleiden.

English

We present EBench, a simulation benchmark that diagnoses generalist mobile manipulation policies beyond a single success-rate scalar. EBench comprises 26 diverse and challenging manipulation tasks annotated along 5 capability dimensions and 4 generalization dimensions. We evaluate state-of-the-art generalist manipulation models including π_0, π_{0.5}, XVLA, and InternVLA-A1, and reveal that models with near success rates exhibit strikingly different capability profiles: π_{0.5} achieves the highest test success rate and the best train--test retention, whereas InternVLA-A1 dominates mobile manipulation but collapses on dexterous tasks, and XVLA exhibits strengths on a disjoint set of atomic skills compared to other policies. Beyond capability profiling, EBench analyzes the generalization ability from 4 representative perspectives, identifying the impact of different distribution shift factors. The results reveal strengths and weaknesses of models behind an overall score. We hope this benchmark offers a broad set of diagnostic signals to guide iteration on generalist manipulation models.