Tex3D：通过对抗性3D纹理将物体作为视觉-语言-动作模型的攻击面

摘要

視覺-語言-動作模型在機器人操作任務中展現出卓越性能，但其對物理可實現對抗攻擊的魯棒性仍有待深入探索。現有研究通過語言干擾和二維視覺攻擊揭示了模型脆弱性，但這些攻擊面要么難以反映實際部署場景，要么缺乏物理真實性。相比之下，對抗性三維紋理因其能自然附著於操作物體且易於在物理環境中部署，構成了更具物理合理性與破壞性的威脅。然而將對抗性三維紋理應用於VLA系統面臨核心挑戰：標準三維模擬器無法提供從VLA目標函數到物體外觀的微分優化路徑，導致難以實現端到端優化。為此，我們提出前景-背景解耦技術，通過雙渲染器對齊實現可微分紋理優化，同時保持原始模擬環境不變。為進一步確保攻擊在物理世界的長時程、多視角場景中持續有效，我們設計軌跡感知對抗優化算法，該算法優先處理行為關鍵幀，並採用基於頂點的參數化方法穩定優化過程。基於上述設計，我們開發了Tex3D——首個能在VLA模擬環境中直接進行三維對抗紋理端到端優化的框架。模擬與真實機器人實驗表明，Tex3D能在多種操作任務中顯著降低VLA性能，最高可實現96.7%的任務失敗率。這些實證結果揭示了VLA系統對物理三維對抗攻擊的關鍵脆弱性，凸顯了魯棒性感知訓練的必要性。

English

Vision-language-action (VLA) models have shown strong performance in robotic manipulation, yet their robustness to physically realizable adversarial attacks remains underexplored. Existing studies reveal vulnerabilities through language perturbations and 2D visual attacks, but these attack surfaces are either less representative of real deployment or limited in physical realism. In contrast, adversarial 3D textures pose a more physically plausible and damaging threat, as they are naturally attached to manipulated objects and are easier to deploy in physical environments. Bringing adversarial 3D textures to VLA systems is nevertheless nontrivial. A central obstacle is that standard 3D simulators do not provide a differentiable optimization path from the VLA objective function back to object appearance, making it difficult to optimize through an end-to-end manner. To address this, we introduce Foreground-Background Decoupling (FBD), which enables differentiable texture optimization through dual-renderer alignment while preserving the original simulation environment. To further ensure that the attack remains effective across long-horizon and diverse viewpoints in the physical world, we propose Trajectory-Aware Adversarial Optimization (TAAO), which prioritizes behaviorally critical frames and stabilizes optimization with a vertex-based parameterization. Built on these designs, we present Tex3D, the first framework for end-to-end optimization of 3D adversarial textures directly within the VLA simulation environment. Experiments in both simulation and real-robot settings show that Tex3D significantly degrades VLA performance across multiple manipulation tasks, achieving task failure rates of up to 96.7\%. Our empirical results expose critical vulnerabilities of VLA systems to physically grounded 3D adversarial attacks and highlight the need for robustness-aware training.

Tex3D：通过对抗性3D纹理将物体作为视觉-语言-动作模型的攻击面

Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models

摘要

Support