RoboSemanticBench：VLAモデルの行動予測における意味的接地の診断

要旨

視覚言語行動（VLA）モデルは、事前学習済みの言語または視覚言語バックボーンからの意味理解がロボットの行動予測を導くべきという前提に基づいて構築されている。しかし、ロボットのファインチューニングはタスク固有の行動分布に対する模倣として最適化されており、多くの評価は視覚的または指示-行動のショートカットによって解くことができる。本稿では、行動予測における意味的接地を診断するための具現化ベンチマークであるRoboSemanticBench（RSB）を導入する。すなわち、ポストトレーニングされたVLAモデルが複雑な指示の意味を活用して正しい物理的対象を選択し操作できるかを評価する。各エピソードにおいて、ロボットは多肢選択式の数学または一般知識問題を受け取り、候補となる回答ブロックを観察し、正解に対応するブロックを把持しなければならない。RSBは、制御された算術、小学校レベルの数学的理解、ならびに常識的または事実に基づく理解を、4択および10択のスイートでカバーする。代表的なVLAモデルにわたる評価の結果、多くのポリシーは候補ブロックを把持することを学習するものの、把持成功率を制御した後では、意味的に正しいブロックをほぼランダムまたはそれ以下の割合でしか選択せず、バックボーンレベルの意味能力と行動予測との間に持続的な乖離があることが明らかになった。

English

Vision-language-action (VLA) models are built on the premise that semantic understanding from pretrained language or vision-language backbones should guide robot action prediction. Yet robot fine-tuning is optimized as imitation over task-specific action distributions, and many evaluations can be solved through visual or instruction-action shortcuts. We introduce RoboSemanticBench (RSB), an embodied benchmark for diagnosing semantic grounding in action prediction: whether post-trained VLA models can use complex instruction semantics to select and manipulate the correct physical target. In each episode, a robot receives a multiple-choice math or general-knowledge question, observes candidate answer blocks, and must grasp the block corresponding to the correct answer. RSB covers controlled arithmetic, grade-school mathematical understanding, and commonsense or factual understanding under four-choice and ten-choice suites. Across representative VLA models, we find that many policies learn to grasp candidate blocks but select the semantically correct block at near-random or below-random rates after controlling for grasp success, revealing a persistent gap between backbone-level semantic competence and action prediction.