Tex3D: 視覚-言語-行動モデルに対する敵対的3Dテクスチャによるオブジェクト攻撃面

要旨

視覚言語行動（VLA）モデルはロボットマニピュレーションにおいて強力な性能を示すが、物理的に実現可能な敵対的攻撃に対する頑健性は未だ十分に検討されていない。既存研究は言語摂動や2次元視覚攻撃による脆弱性を明らかにしてきたが、これらの攻撃表面は実際の展開環境を十分に反映していないか、物理的な現実性に限界がある。対照的に、敵対的3Dテクスチャは操作対象物体に自然に付随し、物理環境への導入が容易であるため、より物理的に妥当かつ深刻な脅威となり得る。しかし、VLAシステムに対する敵対的3Dテクスチャの適用は容易ではない。主要な障壁は、標準的な3DシミュレータがVLAの目的関数から物体外観への微分可能な最適化経路を提供しないため、エンドツーエンドでの最適化が困難である点にある。この問題に対処するため、我々は前景と背景の分離（FBD）を提案する。FBDはデュアルレンダラーの整合性を維持しつつ、元のシミュレーション環境を保存した微分可能なテクスチャ最適化を実現する。さらに、物理世界における長期的な動作経路や多様な視点に対して攻撃効果を維持するため、行動的に重要なフレームを優先し、頂点ベースのパラメータ化で最適化を安定化する経路認識敵対的最適化（TAAO）を考案した。これらの設計に基づき、VLAシミュレーション環境内で直接3D敵対的テクスチャをエンドツーエンド最適化する初のフレームワークであるTex3Dを開発した。シミュレーションおよび実ロボット環境での実験により、Tex3Dが多様なマニピュレーションタスクにおいてVLAの性能を大幅に劣化させ、最大96.7%のタスク失敗率を達成することを実証した。本実証結果は、VLAシステムが物理に根ざした3D敵対的攻撃に対して極めて脆弱であることを暴露し、頑健性を考慮した訓練の必要性を浮き彫りにする。

English

Vision-language-action (VLA) models have shown strong performance in robotic manipulation, yet their robustness to physically realizable adversarial attacks remains underexplored. Existing studies reveal vulnerabilities through language perturbations and 2D visual attacks, but these attack surfaces are either less representative of real deployment or limited in physical realism. In contrast, adversarial 3D textures pose a more physically plausible and damaging threat, as they are naturally attached to manipulated objects and are easier to deploy in physical environments. Bringing adversarial 3D textures to VLA systems is nevertheless nontrivial. A central obstacle is that standard 3D simulators do not provide a differentiable optimization path from the VLA objective function back to object appearance, making it difficult to optimize through an end-to-end manner. To address this, we introduce Foreground-Background Decoupling (FBD), which enables differentiable texture optimization through dual-renderer alignment while preserving the original simulation environment. To further ensure that the attack remains effective across long-horizon and diverse viewpoints in the physical world, we propose Trajectory-Aware Adversarial Optimization (TAAO), which prioritizes behaviorally critical frames and stabilizes optimization with a vertex-based parameterization. Built on these designs, we present Tex3D, the first framework for end-to-end optimization of 3D adversarial textures directly within the VLA simulation environment. Experiments in both simulation and real-robot settings show that Tex3D significantly degrades VLA performance across multiple manipulation tasks, achieving task failure rates of up to 96.7\%. Our empirical results expose critical vulnerabilities of VLA systems to physically grounded 3D adversarial attacks and highlight the need for robustness-aware training.

Tex3D: 視覚-言語-行動モデルに対する敵対的3Dテクスチャによるオブジェクト攻撃面

Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models

要旨

Support