视觉-语言-动作模型的机制可解释性导向

摘要

视觉-语言-动作（VLA）模型是实现通用型具身智能体的有前景途径，这类智能体能够快速适应新任务、新模态和新环境。然而，当前用于解释和引导VLA模型的方法远不及传统机器人技术流程，后者基于明确的运动学、动力学和控制模型。这种机制性理解的缺失，是将在学习策略部署于现实世界机器人应用中的核心挑战，因为在这些场景中，鲁棒性和可解释性至关重要。受大语言模型机制可解释性进展的启发，我们首次提出了通过内部表示来解读和引导VLA模型的框架，使得在推理时能够直接干预模型行为。我们将Transformer层中的前馈激活投影到词嵌入基上，识别出与动作选择因果关联的稀疏语义方向——如速度和方向。基于这些发现，我们引入了一种通用的激活引导方法，无需微调、奖励信号或环境交互，即可实时调节行为。我们在两个最新的开源VLA模型Pi0和OpenVLA上评估了该方法，并在仿真环境（LIBERO）和物理机器人（UR5）上展示了零样本行为控制能力。本研究表明，具身VLA模型的可解释组件能够被系统地用于控制，为机器人学中透明且可引导的基础模型确立了新范式。

English

Vision-Language-Action (VLA) models are a promising path to realizing generalist embodied agents that can quickly adapt to new tasks, modalities, and environments. However, methods for interpreting and steering VLAs fall far short of classical robotics pipelines, which are grounded in explicit models of kinematics, dynamics, and control. This lack of mechanistic insight is a central challenge for deploying learned policies in real-world robotics, where robustness and explainability are critical. Motivated by advances in mechanistic interpretability for large language models, we introduce the first framework for interpreting and steering VLAs via their internal representations, enabling direct intervention in model behavior at inference time. We project feedforward activations within transformer layers onto the token embedding basis, identifying sparse semantic directions - such as speed and direction - that are causally linked to action selection. Leveraging these findings, we introduce a general-purpose activation steering method that modulates behavior in real time, without fine-tuning, reward signals, or environment interaction. We evaluate this method on two recent open-source VLAs, Pi0 and OpenVLA, and demonstrate zero-shot behavioral control in simulation (LIBERO) and on a physical robot (UR5). This work demonstrates that interpretable components of embodied VLAs can be systematically harnessed for control - establishing a new paradigm for transparent and steerable foundation models in robotics.

视觉-语言-动作模型的机制可解释性导向

Mechanistic interpretability for steering vision-language-action models

摘要

Support