RoboSemanticBench:診斷VLA模型在動作預測中的語義基礎
RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models
June 1, 2026
作者: Bin Yu, Yao Zhang, Haishan Liu, Shijie Lian, Yuliang Wei, Xiaopeng Lin, Zhaolong Shen, Changti Wu, Ruina Hu, Bailing Wang, Cong Huang, Kai Chen
cs.AI
摘要
視覺-語言-行動(VLA)模型的建構前提是,預訓練語言或視覺-語言骨幹網絡中的語義理解應能指導機器人的動作預測。然而,機器人的微調是透過模仿任務特定的動作分佈來優化,且許多評估可透過視覺或指令-動作捷徑來解決。我們提出RoboSemanticBench(RSB),這是一個具體的基準測試,用於診斷動作預測中的語義基礎能力:即後訓練的VLA模型能否利用複雜的指令語義來選擇並操作正確的物理目標。在每個回合中,機器人會收到一道多選題(涉及數學或常識知識),觀察候選答案方塊後,必須抓取對應正確答案的方塊。RSB涵蓋控制性算術、小學程度的數學理解,以及常識或事實理解,分別提供四選一與十選一的測驗套件。在具代表性的VLA模型上,我們發現許多策略學會抓取候選方塊,但在控制抓取成功後,選中語義正確方塊的表現接近隨機或低於隨機水準,這揭示了骨幹網絡層級語義能力與動作預測之間持續存在的落差。
English
Vision-language-action (VLA) models are built on the premise that semantic understanding from pretrained language or vision-language backbones should guide robot action prediction. Yet robot fine-tuning is optimized as imitation over task-specific action distributions, and many evaluations can be solved through visual or instruction-action shortcuts. We introduce RoboSemanticBench (RSB), an embodied benchmark for diagnosing semantic grounding in action prediction: whether post-trained VLA models can use complex instruction semantics to select and manipulate the correct physical target. In each episode, a robot receives a multiple-choice math or general-knowledge question, observes candidate answer blocks, and must grasp the block corresponding to the correct answer. RSB covers controlled arithmetic, grade-school mathematical understanding, and commonsense or factual understanding under four-choice and ten-choice suites. Across representative VLA models, we find that many policies learn to grasp candidate blocks but select the semantically correct block at near-random or below-random rates after controlling for grasp success, revealing a persistent gap between backbone-level semantic competence and action prediction.