視覚-言語-行動モデルの制御のための機構論的解釈可能性

要旨

Vision-Language-Action (VLA) モデルは、新しいタスク、モダリティ、環境に迅速に適応できる汎用エージェントを実現するための有望なアプローチである。しかし、VLAを解釈し制御する手法は、運動学、力学、制御の明示的なモデルに基づいた古典的なロボティクスパイプラインに比べて大きく遅れている。この機構的洞察の欠如は、ロバスト性と説明可能性が重要な現実世界のロボティクスにおいて、学習されたポリシーを展開する上での中心的な課題である。大規模言語モデルの機構的解釈可能性の進展に触発され、我々はVLAの内部表現を通じてその動作を解釈し制御するための初のフレームワークを提案する。これにより、推論時にモデルの動作に直接介入することが可能となる。トランスフォーマー層内のフィードフォワード活性化をトークン埋め込み基底に投影し、速度や方向などのスパースな意味方向を特定し、それらが行動選択と因果的に結びついていることを示す。これらの知見を活用し、ファインチューニング、報酬信号、環境相互作用を必要とせずに、リアルタイムで動作を調整する汎用の活性化制御手法を導入する。この手法を、最近のオープンソースVLAであるPi0とOpenVLAで評価し、シミュレーション（LIBERO）および物理ロボット（UR5）上でのゼロショット行動制御を実証する。本研究は、具現化されたVLAの解釈可能な構成要素を体系的に制御に活用できることを示し、ロボティクスにおける透明かつ制御可能な基盤モデルの新たなパラダイムを確立する。

English

Vision-Language-Action (VLA) models are a promising path to realizing generalist embodied agents that can quickly adapt to new tasks, modalities, and environments. However, methods for interpreting and steering VLAs fall far short of classical robotics pipelines, which are grounded in explicit models of kinematics, dynamics, and control. This lack of mechanistic insight is a central challenge for deploying learned policies in real-world robotics, where robustness and explainability are critical. Motivated by advances in mechanistic interpretability for large language models, we introduce the first framework for interpreting and steering VLAs via their internal representations, enabling direct intervention in model behavior at inference time. We project feedforward activations within transformer layers onto the token embedding basis, identifying sparse semantic directions - such as speed and direction - that are causally linked to action selection. Leveraging these findings, we introduce a general-purpose activation steering method that modulates behavior in real time, without fine-tuning, reward signals, or environment interaction. We evaluate this method on two recent open-source VLAs, Pi0 and OpenVLA, and demonstrate zero-shot behavioral control in simulation (LIBERO) and on a physical robot (UR5). This work demonstrates that interpretable components of embodied VLAs can be systematically harnessed for control - establishing a new paradigm for transparent and steerable foundation models in robotics.

視覚-言語-行動モデルの制御のための機構論的解釈可能性

Mechanistic interpretability for steering vision-language-action models

要旨

Support