視覺-語言-動作模型的機制可解釋性導向研究

摘要

視覺-語言-動作（VLA）模型是實現通用型具身代理的一條充滿前景的道路，這些代理能夠快速適應新任務、模態和環境。然而，解釋和引導VLA模型的方法遠遠落後於基於明確運動學、動力學和控制模型的傳統機器人技術流程。這種機制性洞察的缺乏，是在現實世界機器人中部署學習策略的核心挑戰，其中魯棒性和可解釋性至關重要。受大型語言模型機制可解釋性進展的啟發，我們首次引入了一個通過內部表示來解釋和引導VLA模型的框架，使得在推理時能夠直接干預模型行為。我們將變壓器層中的前饋激活投影到詞元嵌入基上，識別出與動作選擇因果相關的稀疏語義方向——如速度和方向。基於這些發現，我們提出了一種通用的激活引導方法，該方法能夠實時調節行為，無需微調、獎勵信號或環境交互。我們在兩個最新的開源VLA模型——Pi0和OpenVLA上評估了這一方法，並在模擬環境（LIBERO）和物理機器人（UR5）上展示了零樣本行為控制。這項工作表明，具身VLA模型的可解釋組件可以被系統性地用於控制——為機器人領域中透明且可引導的基礎模型建立了一種新範式。

English

Vision-Language-Action (VLA) models are a promising path to realizing generalist embodied agents that can quickly adapt to new tasks, modalities, and environments. However, methods for interpreting and steering VLAs fall far short of classical robotics pipelines, which are grounded in explicit models of kinematics, dynamics, and control. This lack of mechanistic insight is a central challenge for deploying learned policies in real-world robotics, where robustness and explainability are critical. Motivated by advances in mechanistic interpretability for large language models, we introduce the first framework for interpreting and steering VLAs via their internal representations, enabling direct intervention in model behavior at inference time. We project feedforward activations within transformer layers onto the token embedding basis, identifying sparse semantic directions - such as speed and direction - that are causally linked to action selection. Leveraging these findings, we introduce a general-purpose activation steering method that modulates behavior in real time, without fine-tuning, reward signals, or environment interaction. We evaluate this method on two recent open-source VLAs, Pi0 and OpenVLA, and demonstrate zero-shot behavioral control in simulation (LIBERO) and on a physical robot (UR5). This work demonstrates that interpretable components of embodied VLAs can be systematically harnessed for control - establishing a new paradigm for transparent and steerable foundation models in robotics.

視覺-語言-動作模型的機制可解釋性導向研究

Mechanistic interpretability for steering vision-language-action models

摘要

Support