EBench: 범용 이동 조작 정책의 요소 진단

초록

본 논문에서는 단일 성공률 스칼라를 넘어 범용 이동 조작 정책을 진단하는 시뮬레이션 벤치마크인 EBench를 제시한다. EBench는 5가지 능력 차원과 4가지 일반화 차원에 따라 주석이 달린 26개의 다양하고 도전적인 조작 과제로 구성된다. 우리는 π_0, π_{0.5}, XVLA, InternVLA-A1을 포함한 최신 범용 조작 모델을 평가했으며, 유사한 성공률을 보이는 모델들이 현저히 다른 능력 프로필을 나타낸다는 사실을 발견했다. π_{0.5}는 가장 높은 테스트 성공률과 최상의 학습-테스트 유지율을 달성한 반면, InternVLA-A1은 이동 조작에서 우세하지만 정밀 작업에서는 성능이 급락했으며, XVLA는 다른 정책들과는 상이한 기본 기술 세트에서 강점을 보였다. 능력 프로파일링 외에도 EBench는 4가지 대표적 관점에서 일반화 능력을 분석하여 다양한 분포 변화 요인의 영향을 파악한다. 이러한 결과는 전체 점수 이면에 있는 모델의 강점과 약점을 드러낸다. 본 벤치마크가 범용 조작 모델의 반복적 개선을 안내하는 폭넓은 진단 신호를 제공할 수 있기를 기대한다.

English

We present EBench, a simulation benchmark that diagnoses generalist mobile manipulation policies beyond a single success-rate scalar. EBench comprises 26 diverse and challenging manipulation tasks annotated along 5 capability dimensions and 4 generalization dimensions. We evaluate state-of-the-art generalist manipulation models including π_0, π_{0.5}, XVLA, and InternVLA-A1, and reveal that models with near success rates exhibit strikingly different capability profiles: π_{0.5} achieves the highest test success rate and the best train--test retention, whereas InternVLA-A1 dominates mobile manipulation but collapses on dexterous tasks, and XVLA exhibits strengths on a disjoint set of atomic skills compared to other policies. Beyond capability profiling, EBench analyzes the generalization ability from 4 representative perspectives, identifying the impact of different distribution shift factors. The results reveal strengths and weaknesses of models behind an overall score. We hope this benchmark offers a broad set of diagnostic signals to guide iteration on generalist manipulation models.