ChatPaper.aiChatPaper

VLSA:具备即插即用安全约束层的视觉-语言-行动模型

VLSA: Vision-Language-Action Models with Plug-and-Play Safety Constraint Layer

December 9, 2025
作者: Songqiao Hu, Zeyi Liu, Shuang Liu, Jun Cen, Zihan Meng, Xiao He
cs.AI

摘要

视觉-语言-动作(VLA)模型在多样化机器人操作任务中展现出卓越的泛化能力。然而,由于在物理交互过程中需同时满足任务执行与安全保证的双重要求(尤其是避免潜在碰撞),在非结构化环境中部署这类模型仍存在挑战。本研究提出名为AEGIS的视觉-语言-安全动作(VLSA)架构,其通过控制屏障函数构建了即插即用的安全约束层。AEGIS可直接与现有VLA模型集成,在保持原有指令跟随性能的同时,通过理论保证提升系统安全性。为评估架构效能,我们构建了涵盖不同空间复杂度与障碍物干预程度的安全关键基准测试SafeLIBERO。大量实验表明,该方法显著优于现有先进基线模型:AEGIS在障碍物规避率上提升59.16%,同时任务执行成功率提高17.25%。为促进可复现性与后续研究,我们已公开代码、模型及基准数据集(https://vlsa-aegis.github.io/)。
English
Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in generalizing across diverse robotic manipulation tasks. However, deploying these models in unstructured environments remains challenging due to the critical need for simultaneous task compliance and safety assurance, particularly in preventing potential collisions during physical interactions. In this work, we introduce a Vision-Language-Safe Action (VLSA) architecture, named AEGIS, which contains a plug-and-play safety constraint (SC) layer formulated via control barrier functions. AEGIS integrates directly with existing VLA models to improve safety with theoretical guarantees, while maintaining their original instruction-following performance. To evaluate the efficacy of our architecture, we construct a comprehensive safety-critical benchmark SafeLIBERO, spanning distinct manipulation scenarios characterized by varying degrees of spatial complexity and obstacle intervention. Extensive experiments demonstrate the superiority of our method over state-of-the-art baselines. Notably, AEGIS achieves a 59.16% improvement in obstacle avoidance rate while substantially increasing the task execution success rate by 17.25%. To facilitate reproducibility and future research, we make our code, models, and the benchmark datasets publicly available at https://vlsa-aegis.github.io/.
PDF72December 17, 2025