RoboSemanticBench：診斷VLA模型在動作預測中的語義基礎

摘要

視覺-語言-行動（VLA）模型的建構前提是，預訓練語言或視覺-語言骨幹網絡中的語義理解應能指導機器人的動作預測。然而，機器人的微調是透過模仿任務特定的動作分佈來優化，且許多評估可透過視覺或指令-動作捷徑來解決。我們提出RoboSemanticBench（RSB），這是一個具體的基準測試，用於診斷動作預測中的語義基礎能力：即後訓練的VLA模型能否利用複雜的指令語義來選擇並操作正確的物理目標。在每個回合中，機器人會收到一道多選題（涉及數學或常識知識），觀察候選答案方塊後，必須抓取對應正確答案的方塊。RSB涵蓋控制性算術、小學程度的數學理解，以及常識或事實理解，分別提供四選一與十選一的測驗套件。在具代表性的VLA模型上，我們發現許多策略學會抓取候選方塊，但在控制抓取成功後，選中語義正確方塊的表現接近隨機或低於隨機水準，這揭示了骨幹網絡層級語義能力與動作預測之間持續存在的落差。

English

Vision-language-action (VLA) models are built on the premise that semantic understanding from pretrained language or vision-language backbones should guide robot action prediction. Yet robot fine-tuning is optimized as imitation over task-specific action distributions, and many evaluations can be solved through visual or instruction-action shortcuts. We introduce RoboSemanticBench (RSB), an embodied benchmark for diagnosing semantic grounding in action prediction: whether post-trained VLA models can use complex instruction semantics to select and manipulate the correct physical target. In each episode, a robot receives a multiple-choice math or general-knowledge question, observes candidate answer blocks, and must grasp the block corresponding to the correct answer. RSB covers controlled arithmetic, grade-school mathematical understanding, and commonsense or factual understanding under four-choice and ten-choice suites. Across representative VLA models, we find that many policies learn to grasp candidate blocks but select the semantically correct block at near-random or below-random rates after controlling for grasp success, revealing a persistent gap between backbone-level semantic competence and action prediction.