RoboSemanticBench: Diagnose van Semantische Gronding in Actievoorspelling voor VLA-modellen

Samenvatting

Beeld-taal-actie (BTA) modellen zijn gebouwd op de veronderstelling dat semantisch begrip uit voorgetrainde taal- of beeld-taal-backbones de robotactievoorspelling moet sturen. Toch wordt robot-fijnafstemming geoptimaliseerd als imitatie over taakspecifieke actieverdelingen, en veel evaluaties kunnen worden opgelost via visuele of instructie-actie shortcuts. We introduceren RoboSemanticBench (RSB), een belichaamde benchmark voor het diagnosticeren van semantische verankering in actievoorspelling: of post-getrainde BTA-modellen complexe instructiesemantiek kunnen gebruiken om het juiste fysieke doel te selecteren en te manipuleren. In elke aflevering ontvangt een robot een meerkeuzevraag over wiskunde of algemene kennis, observeert kandidaat-antwoordblokken, en moet het blok grijpen dat overeenkomt met het juiste antwoord. RSB omvat gecontroleerde rekenkunde, wiskundig begrip op basisschoolniveau, en gezond verstand of feitelijk begrip in vierkeuze- en tienkeuzesets. Bij representatieve BTA-modellen vinden we dat veel strategieën leren om kandidaat-blokken te grijpen, maar het semantisch correcte blok selecteren met bijna-willekeurige of onder-willekeurige percentages na correctie voor grijpsucces, wat een aanhoudende kloof onthult tussen semantische competentie op backbone-niveau en actievoorspelling.

English

Vision-language-action (VLA) models are built on the premise that semantic understanding from pretrained language or vision-language backbones should guide robot action prediction. Yet robot fine-tuning is optimized as imitation over task-specific action distributions, and many evaluations can be solved through visual or instruction-action shortcuts. We introduce RoboSemanticBench (RSB), an embodied benchmark for diagnosing semantic grounding in action prediction: whether post-trained VLA models can use complex instruction semantics to select and manipulate the correct physical target. In each episode, a robot receives a multiple-choice math or general-knowledge question, observes candidate answer blocks, and must grasp the block corresponding to the correct answer. RSB covers controlled arithmetic, grade-school mathematical understanding, and commonsense or factual understanding under four-choice and ten-choice suites. Across representative VLA models, we find that many policies learn to grasp candidate blocks but select the semantically correct block at near-random or below-random rates after controlling for grasp success, revealing a persistent gap between backbone-level semantic competence and action prediction.