軌跡の中のかくれんぼ：VLA実行時監視のための障害信号の発見

要旨

ビジョン・ランゲージ・アクション（VLA）モデルは、ロボットが自然言語の指示に従い多様なタスクにわたって一般化することを可能にするが、実世界展開における信頼性を損なう実行失敗に対して脆弱である。したがって、実行中のそのような失敗を検出することは、身体化システムのロバストな展開にとって極めて重要である。既存の失敗検出手法は、高コストな行動再サンプリングや外部モデルに依存するか、あるいは代替手法として軌跡レベルのラベルをすべてのタイムステップに一律に伝搬させ、局所的な失敗信号を不明瞭にしてしまう。本稿では、VLA失敗検出を粗い教師あり学習問題として定式化するフレームワーク「Hide-and-Seek」を提案する。軌跡間および軌跡内の対照的目的を組み合わせることで、Hide-and-Seekは失敗を示唆する行動を特定し、ステップレベルのアノテーションを一切用いずに軌跡レベルの教師信号のみから時間構造を持つ失敗信号を誘導する。我々は、LIBERO、VLABench、および実世界ロボットプラットフォームにおいて、OpenVLA、π_0、π_{0.5}の3つの代表的なVLA方策を用いて評価を行った。本手法は、コンフォーマル予測の下で実用的な精度–適時性トレードオフを達成し、最先端のマルチタスク失敗検出性能を示すとともに、既知タスクと未見タスクの両方に良好に一般化する。

English

Vision-Language-Action (VLA) models enable robots to follow natural language instructions and generalize across diverse tasks, but they remain vulnerable to execution failures that compromise reliability in real-world deployment. Detecting such failures during execution is therefore critical for the robust deployment of embodied systems. Existing failure detection methods either rely on expensive action resampling or external models, while alternatives propagate trajectory-level labels uniformly across every timestep, obscuring localized failure signals. In this paper, we propose Hide-and-Seek, a framework that formulates VLA failure detection as a coarsely supervised learning problem. By combining inter-trajectory and intra-trajectory contrastive objectives, Hide-and-Seek localizes failure-indicative actions and induces temporally structured failure signals from trajectory-level supervision alone, without any step-level annotation. We evaluate Hide-and-Seek on LIBERO, VLABench, and a real-world robotic platform across three representative VLA policies: OpenVLA, π_0, and π_{0.5}.Our method achieves state-of-the-art multi-task failure detection performance with a practical accuracy--timeliness trade-off under conformal prediction, and generalizes well to both seen and unseen tasks.