Tex3D：通过对抗性3D纹理将物体作为视觉-语言-动作模型的攻击面

摘要

视觉-语言-动作模型在机器人操控任务中展现出强大性能，但其对物理可实现对抗攻击的鲁棒性研究仍不充分。现有研究通过语言扰动和二维视觉攻击揭示了系统脆弱性，但这些攻击面要么难以代表实际部署场景，要么缺乏物理真实性。相比之下，对抗性三维纹理因其可自然附着于被操控物体表面且易于在物理环境中部署，构成了更具物理可行性和破坏性的威胁。然而将对抗性三维纹理引入VLA系统面临核心挑战：标准三维模拟器无法提供从VLA目标函数到物体外观的可微分优化路径，导致难以进行端到端优化。为此，我们提出前景-背景解耦技术，通过双渲染器对齐实现可微分纹理优化，同时保持原始模拟环境不变。为确保攻击在物理世界长时程、多视角下持续有效，我们进一步提出轨迹感知对抗优化算法，该算法优先处理行为关键帧，并采用基于顶点的参数化方法稳定优化过程。基于这些设计，我们开发了Tex3D——首个直接在VLA模拟环境中实现三维对抗纹理端到端优化的框架。模拟与真实机器人实验表明，Tex3D能在多种操控任务中显著降低VLA性能，最高可使任务失败率达到96.7%。我们的实证结果揭示了VLA系统对物理三维对抗攻击的关键脆弱性，凸显了鲁棒性感知训练的必要性。

English

Vision-language-action (VLA) models have shown strong performance in robotic manipulation, yet their robustness to physically realizable adversarial attacks remains underexplored. Existing studies reveal vulnerabilities through language perturbations and 2D visual attacks, but these attack surfaces are either less representative of real deployment or limited in physical realism. In contrast, adversarial 3D textures pose a more physically plausible and damaging threat, as they are naturally attached to manipulated objects and are easier to deploy in physical environments. Bringing adversarial 3D textures to VLA systems is nevertheless nontrivial. A central obstacle is that standard 3D simulators do not provide a differentiable optimization path from the VLA objective function back to object appearance, making it difficult to optimize through an end-to-end manner. To address this, we introduce Foreground-Background Decoupling (FBD), which enables differentiable texture optimization through dual-renderer alignment while preserving the original simulation environment. To further ensure that the attack remains effective across long-horizon and diverse viewpoints in the physical world, we propose Trajectory-Aware Adversarial Optimization (TAAO), which prioritizes behaviorally critical frames and stabilizes optimization with a vertex-based parameterization. Built on these designs, we present Tex3D, the first framework for end-to-end optimization of 3D adversarial textures directly within the VLA simulation environment. Experiments in both simulation and real-robot settings show that Tex3D significantly degrades VLA performance across multiple manipulation tasks, achieving task failure rates of up to 96.7\%. Our empirical results expose critical vulnerabilities of VLA systems to physically grounded 3D adversarial attacks and highlight the need for robustness-aware training.