RoboSemanticBench: VLA 모델의 행동 예측에서 의미적 접지 진단

초록

시각-언어-행동(VLA) 모델은 사전 학습된 언어 또는 시각-언어 백본의 의미 이해가 로봇의 행동 예측을 안내해야 한다는 전제 위에 구축된다. 그러나 로봇의 미세 조정은 작업별 행동 분포에 대한 모방 학습으로 최적화되며, 많은 평가는 시각적 또는 명령-행동 지름길을 통해 해결될 수 있다. 우리는 행동 예측에서의 의미적 근거 진단을 위한 내재적 벤치마크인 RoboSemanticBench(RSB)를 소개한다. 이는 사후 훈련된 VLA 모델이 복잡한 명령 의미를 사용하여 올바른 물리적 대상을 선택하고 조작할 수 있는지 여부를 평가한다. 각 에피소드에서 로봇은 객관식 수학 또는 일반 상식 질문을 받고, 후보 답안 블록들을 관찰한 후, 정답에 해당하는 블록을 집어야 한다. RSB는 제어된 산술, 초등 수준의 수학적 이해, 그리고 상식적 또는 사실적 이해를 네 가지 선택지와 열 가지 선택지 체계에서 다룬다. 대표적인 VLA 모델들에 대한 실험 결과, 많은 정책이 후보 블록을 집는 법을 학습하지만, 집기 성공을 통제한 후에도 의미적으로 올바른 블록을 선택하는 비율은 무작위 수준에 가깝거나 그 이하로 나타나며, 이는 백본 수준의 의미적 능력과 행동 예측 사이에 지속적인 격차가 존재함을 드러낸다.

English

Vision-language-action (VLA) models are built on the premise that semantic understanding from pretrained language or vision-language backbones should guide robot action prediction. Yet robot fine-tuning is optimized as imitation over task-specific action distributions, and many evaluations can be solved through visual or instruction-action shortcuts. We introduce RoboSemanticBench (RSB), an embodied benchmark for diagnosing semantic grounding in action prediction: whether post-trained VLA models can use complex instruction semantics to select and manipulate the correct physical target. In each episode, a robot receives a multiple-choice math or general-knowledge question, observes candidate answer blocks, and must grasp the block corresponding to the correct answer. RSB covers controlled arithmetic, grade-school mathematical understanding, and commonsense or factual understanding under four-choice and ten-choice suites. Across representative VLA models, we find that many policies learn to grasp candidate blocks but select the semantically correct block at near-random or below-random rates after controlling for grasp success, revealing a persistent gap between backbone-level semantic competence and action prediction.