ChatPaper.aiChatPaper

RoboSemanticBench: 诊断VLA模型动作预测中的语义基础

RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models

June 1, 2026
作者: Bin Yu, Yao Zhang, Haishan Liu, Shijie Lian, Yuliang Wei, Xiaopeng Lin, Zhaolong Shen, Changti Wu, Ruina Hu, Bailing Wang, Cong Huang, Kai Chen
cs.AI

摘要

视觉-语言-动作(VLA)模型建立在这样一个前提上:预训练语言或视觉-语言主干网络的语义理解应指导机器人动作预测。然而,机器人微调作为对任务特定动作分布的模仿进行优化,许多评估可以通过视觉或指令-动作捷径来解决。我们推出RoboSemanticBench(RSB),这是一个具身基准测试,用于诊断动作预测中的语义基础:后训练的VLA模型能否利用复杂指令语义来选择和操作正确的物理目标。在每个回合中,机器人接收一道数学或常识知识选择题,观察候选答案块,并必须抓取对应正确答案的块。RSB涵盖四选一和十选一的受控算术、小学水平数学理解以及常识或事实理解题。在代表性VLA模型上的实验表明,许多策略学会了抓取候选块,但在控制抓取成功率后,选择语义正确块的表现接近随机或低于随机水平,揭示了主干网络层面的语义能力与动作预测之间的持续差距。
English
Vision-language-action (VLA) models are built on the premise that semantic understanding from pretrained language or vision-language backbones should guide robot action prediction. Yet robot fine-tuning is optimized as imitation over task-specific action distributions, and many evaluations can be solved through visual or instruction-action shortcuts. We introduce RoboSemanticBench (RSB), an embodied benchmark for diagnosing semantic grounding in action prediction: whether post-trained VLA models can use complex instruction semantics to select and manipulate the correct physical target. In each episode, a robot receives a multiple-choice math or general-knowledge question, observes candidate answer blocks, and must grasp the block corresponding to the correct answer. RSB covers controlled arithmetic, grade-school mathematical understanding, and commonsense or factual understanding under four-choice and ten-choice suites. Across representative VLA models, we find that many policies learn to grasp candidate blocks but select the semantically correct block at near-random or below-random rates after controlling for grasp success, revealing a persistent gap between backbone-level semantic competence and action prediction.