ChatPaper.aiChatPaper

LIBERO-Plus:視覺-語言-動作模型的深度魯棒性分析

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

October 15, 2025
作者: Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, Jinlan Fu, Jingjing Gong, Xipeng Qiu
cs.AI

摘要

视觉-语言-动作(VLA)模型在机器人操作基准测试中报告了令人瞩目的成功率,然而这些结果可能掩盖了其在鲁棒性方面的根本弱点。我们通过引入七个维度的受控扰动进行系统性脆弱性分析:物体布局、摄像机视角、机器人初始状态、语言指令、光照条件、背景纹理及传感器噪声。我们全面分析了多个最先进的模型,揭示了在表面能力之下的一致性脆弱。我们的分析暴露了关键弱点:模型对扰动因素表现出极端的敏感性,包括摄像机视角和机器人初始状态,在适度扰动下,性能从95%骤降至30%以下。令人惊讶的是,模型对语言变化大多不敏感,进一步的实验揭示模型往往完全忽略语言指令。我们的发现挑战了高基准分数等同于真正能力的假设,并强调了评估实践中需在现实变化下评估可靠性的必要性。
English
Visual-Language-Action (VLA) models report impressive success rates on robotic manipulation benchmarks, yet these results may mask fundamental weaknesses in robustness. We perform a systematic vulnerability analysis by introducing controlled perturbations across seven dimensions: objects layout, camera viewpoints, robot initial states, language instructions, light conditions, background textures and sensor noise. We comprehensively analyzed multiple state-of-the-art models and revealed consistent brittleness beneath apparent competence. Our analysis exposes critical weaknesses: models exhibit extreme sensitivity to perturbation factors, including camera viewpoints and robot initial states, with performance dropping from 95% to below 30% under modest perturbations. Surprisingly, models are largely insensitive to language variations, with further experiments revealing that models tend to ignore language instructions completely. Our findings challenge the assumption that high benchmark scores equate to true competency and highlight the need for evaluation practices that assess reliability under realistic variation.
PDF425October 16, 2025