SAFE: 비전-언어-행동 모델을 위한 다중 작업 실패 감지

초록

비전-언어-행동 모델(VLAs)은 다양한 조작 작업에서 유망한 로봇 행동을 보여주지만, 새로운 작업에 바로 적용할 경우 성공률이 제한적입니다. 이러한 정책이 환경과 안전하게 상호작용할 수 있도록 하려면, 로봇이 멈추거나 되돌아가거나 도움을 요청할 수 있도록 적시에 경고를 제공하는 실패 감지기가 필요합니다. 그러나 기존의 실패 감지기는 하나 또는 소수의 특정 작업에 대해서만 훈련되고 테스트되는 반면, VLAs는 감지기가 보이지 않는 작업과 새로운 환경에서도 실패를 감지할 수 있도록 일반화되어야 합니다. 본 논문에서는 다중 작업 실패 감지 문제를 소개하고, VLAs와 같은 일반적인 로봇 정책을 위한 실패 감지기인 SAFE를 제안합니다. 우리는 VLA 특징 공간을 분석하고, VLAs가 작업 성공과 실패에 대한 충분한 고수준 지식을 가지고 있으며, 이 지식이 다양한 작업에 걸쳐 일반적이라는 것을 발견했습니다. 이러한 통찰을 바탕으로, SAFE는 VLA 내부 특징을 학습하고 작업 실패 가능성을 나타내는 단일 스칼라 값을 예측하도록 설계되었습니다. SAFE는 성공적이고 실패한 롤아웃에서 훈련되며, 보이지 않는 작업에서 평가됩니다. SAFE는 다양한 정책 아키텍처와 호환됩니다. 우리는 OpenVLA, pi_0, pi_0-FAST를 시뮬레이션 및 실제 환경에서 광범위하게 테스트했습니다. SAFE를 다양한 베이스라인과 비교하여, SAFE가 최신의 실패 감지 성능을 달성하고, conformal prediction을 사용하여 정확도와 감지 시간 사이의 최적의 균형을 보여줌을 확인했습니다. 더 많은 질적 결과는 https://vla-safe.github.io/에서 확인할 수 있습니다.

English

While vision-language-action models (VLAs) have shown promising robotic behaviors across a diverse set of manipulation tasks, they achieve limited success rates when deployed on novel tasks out-of-the-box. To allow these policies to safely interact with their environments, we need a failure detector that gives a timely alert such that the robot can stop, backtrack, or ask for help. However, existing failure detectors are trained and tested only on one or a few specific tasks, while VLAs require the detector to generalize and detect failures also in unseen tasks and novel environments. In this paper, we introduce the multitask failure detection problem and propose SAFE, a failure detector for generalist robot policies such as VLAs. We analyze the VLA feature space and find that VLAs have sufficient high-level knowledge about task success and failure, which is generic across different tasks. Based on this insight, we design SAFE to learn from VLA internal features and predict a single scalar indicating the likelihood of task failure. SAFE is trained on both successful and failed rollouts, and is evaluated on unseen tasks. SAFE is compatible with different policy architectures. We test it on OpenVLA, pi_0, and pi_0-FAST in both simulated and real-world environments extensively. We compare SAFE with diverse baselines and show that SAFE achieves state-of-the-art failure detection performance and the best trade-off between accuracy and detection time using conformal prediction. More qualitative results can be found at https://vla-safe.github.io/.

SAFE: 비전-언어-행동 모델을 위한 다중 작업 실패 감지

SAFE: Multitask Failure Detection for Vision-Language-Action Models

초록

Support