Verstoppertje in trajecten: Faalsignalen ontdekken voor VLA-runtime-monitoring

Samenvatting

Visie-Taal-Actie (VLA)-modellen stellen robots in staat om natuurlijke taal instructies te volgen en te generaliseren over diverse taken, maar ze blijven kwetsbaar voor uitvoeringsfouten die de betrouwbaarheid in praktijktoepassingen in gevaar brengen. Het detecteren van dergelijke fouten tijdens de uitvoering is daarom cruciaal voor de robuuste inzet van belichaamde systemen. Bestaande foutdetectiemethoden vertrouwen ofwel op dure actie-hersampling of externe modellen, terwijl alternatieven labels op trajectniveau uniform over elke tijdstap verspreiden, waardoor gelokaliseerde foutsignalen worden verborgen. In dit artikel stellen we Hide-and-Seek voor, een raamwerk dat VLA-foutdetectie formuleert als een grof gesuperviseerd leerprobleem. Door inter-traject- en intra-traject-contrastieve doelen te combineren, lokaliseert Hide-and-Seek foutindicatieve acties en induceert het temporeel gestructureerde foutsignalen uitsluitend op basis van supervisie op trajectniveau, zonder enige annotatie op stapsniveau. We evalueren Hide-and-Seek op LIBERO, VLABench en een praktijkrobotplatform voor drie representatieve VLA-beleidsvormen: OpenVLA, π_0 en π_{0,5}. Onze methode behaalt state-of-the-art multi-taak foutdetectieprestaties met een praktische afweging tussen nauwkeurigheid en tijdigheid onder conforme voorspelling, en generaliseert goed naar zowel bekende als onbekende taken.

English

Vision-Language-Action (VLA) models enable robots to follow natural language instructions and generalize across diverse tasks, but they remain vulnerable to execution failures that compromise reliability in real-world deployment. Detecting such failures during execution is therefore critical for the robust deployment of embodied systems. Existing failure detection methods either rely on expensive action resampling or external models, while alternatives propagate trajectory-level labels uniformly across every timestep, obscuring localized failure signals. In this paper, we propose Hide-and-Seek, a framework that formulates VLA failure detection as a coarsely supervised learning problem. By combining inter-trajectory and intra-trajectory contrastive objectives, Hide-and-Seek localizes failure-indicative actions and induces temporally structured failure signals from trajectory-level supervision alone, without any step-level annotation. We evaluate Hide-and-Seek on LIBERO, VLABench, and a real-world robotic platform across three representative VLA policies: OpenVLA, π_0, and π_{0.5}.Our method achieves state-of-the-art multi-task failure detection performance with a practical accuracy--timeliness trade-off under conformal prediction, and generalizes well to both seen and unseen tasks.