LIBERO-Para：视觉语言模型复述鲁棒性诊断基准与评估指标

摘要

视觉-语言-动作（VLA）模型通过利用预训练的视觉-语言骨干网络，在机器人操作任务中展现出强大性能。然而，在下游机器人应用场景中，这些模型通常仅使用有限数据进行微调，导致其过度适应特定指令表述，而对改写指令的鲁棒性研究尚不充分。为探究这一缺陷，我们提出LIBERO-Para基准测试框架，通过独立控制动作表达与物体指称的变体，实现语言泛化能力的细粒度分析。在七种VLA模型配置（0.6B-7.5B参数规模）的测试中，我们观察到模型在指令改写场景下出现22-52个百分点的性能一致性下降。这种退化主要源于物体级词汇变异：即使是简单的同义词替换也会导致性能大幅下滑，表明模型依赖表层匹配而非语义 grounding。值得注意的是，80-96%的失败案例源于规划层面的轨迹分歧而非执行错误，这说明指令改写干扰了任务识别过程。传统二元成功率指标将所有改写指令等量齐观，无法区分模型是在不同难度级别表现一致还是依赖简单案例。为此，我们提出PRIDE评估指标，通过语义和句法因子量化改写难度。本基准测试框架及对应代码已开源：https://github.com/cau-hai-lab/LIBERO-Para

English

Vision-Language-Action (VLA) models achieve strong performance in robotic manipulation by leveraging pre-trained vision-language backbones. However, in downstream robotic settings, they are typically fine-tuned with limited data, leading to overfitting to specific instruction formulations and leaving robustness to paraphrased instructions underexplored. To study this gap, we introduce LIBERO-Para, a controlled benchmark that independently varies action expressions and object references for fine-grained analysis of linguistic generalization. Across seven VLA configurations (0.6B-7.5B), we observe consistent performance degradation of 22-52 pp under paraphrasing. This degradation is primarily driven by object-level lexical variation: even simple synonym substitutions cause large drops, indicating reliance on surface-level matching rather than semantic grounding. Moreover, 80-96% of failures arise from planning-level trajectory divergence rather than execution errors, showing that paraphrasing disrupts task identification. Binary success rate treats all paraphrases equally, obscuring whether models perform consistently across difficulty levels or rely on easier cases. To address this, we propose PRIDE, a metric that quantifies paraphrase difficulty using semantic and syntactic factors. Our benchmark and corresponding code are available at: https://github.com/cau-hai-lab/LIBERO-Para

LIBERO-Para：视觉语言模型复述鲁棒性诊断基准与评估指标

LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models

摘要

Support