LIBERO-Para：視覺語言動作模型語義改寫魯棒性診斷基準與評估指標

摘要

視覺-語言-動作模型透過預訓練的視覺-語言骨幹網絡，在機器人操作任務中展現出卓越性能。然而在機器人下游應用場景中，這些模型通常僅能用有限數據進行微調，導致其過度擬合特定指令表述，對改述指令的魯棒性研究仍顯不足。為探究此問題，我們提出LIBERO-Para基準測試框架，透過獨立控制動作表述與物件指稱的變化，實現對語言泛化能力的細粒度分析。在七種VLA模型配置（0.6B-7.5B參數量）的測試中，我們觀察到改述情境下模型性能出現22-52個百分點的一致性衰退。這種衰退主要源於物件層級的詞彙變異：即使是簡單的同義詞替換也會導致性能大幅下降，表明模型依賴於表層詞彙匹配而非語義理解。更有甚者，80-96%的失敗案例源自規劃層級的軌跡分歧而非執行錯誤，這說明改述操作會破壞模型的任務識別能力。傳統的二進制成功率指標將所有改述視為等同，無法區分模型是否在不同難度級別保持穩定表現。為此，我們提出PRIDE評估指標，透過語義與句法特徵量化改述難度。本基準測試框架及相關程式碼已公開於：https://github.com/cau-hai-lab/LIBERO-Para

English

Vision-Language-Action (VLA) models achieve strong performance in robotic manipulation by leveraging pre-trained vision-language backbones. However, in downstream robotic settings, they are typically fine-tuned with limited data, leading to overfitting to specific instruction formulations and leaving robustness to paraphrased instructions underexplored. To study this gap, we introduce LIBERO-Para, a controlled benchmark that independently varies action expressions and object references for fine-grained analysis of linguistic generalization. Across seven VLA configurations (0.6B-7.5B), we observe consistent performance degradation of 22-52 pp under paraphrasing. This degradation is primarily driven by object-level lexical variation: even simple synonym substitutions cause large drops, indicating reliance on surface-level matching rather than semantic grounding. Moreover, 80-96% of failures arise from planning-level trajectory divergence rather than execution errors, showing that paraphrasing disrupts task identification. Binary success rate treats all paraphrases equally, obscuring whether models perform consistently across difficulty levels or rely on easier cases. To address this, we propose PRIDE, a metric that quantifies paraphrase difficulty using semantic and syntactic factors. Our benchmark and corresponding code are available at: https://github.com/cau-hai-lab/LIBERO-Para

LIBERO-Para：視覺語言動作模型語義改寫魯棒性診斷基準與評估指標

LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models

摘要

Support