로보VQA: 로보틱스를 위한 멀티모달 장기 추론

초록

우리는 장기 및 중기 수준의 고차원적 추론에 사용 가능하며, 기존의 좁은 범위의 상향식 단계별 데이터 수집 방식보다 2.2배 높은 처리량을 자랑하는 확장 가능하고, 본질적으로 다양하며 하향식 접근 방식의 데이터 수집 체계를 제시합니다. 우리는 3개의 사무실 건물 전체에서 사용자 요청을 수행하고, 다중 로봇 및 인간 구현체를 활용하여 현실적인 데이터를 수집합니다. 이 데이터를 통해, 모든 구현체를 대상으로 훈련된 모델이 로봇 에피소드만으로 평가될 때조차도 로봇 데이터만으로 훈련된 모델보다 더 나은 성능을 보임을 입증합니다. 고정된 수집 예산 내에서 로봇 수집과 함께 더 저렴한 인간 수집을 활용하는 것이 유리하다는 점을 발견했습니다. 우리는 로보틱스 중심의 시각적 질의응답을 위해 829,502개의 (비디오, 텍스트) 쌍을 포함하며 29,520개의 독특한 지시문으로 구성된 대규모 및 고도로 다양한 데이터셋인 RoboVQA를 공개합니다. 또한, 개입 메커니즘을 통해 실제 로봇 실험을 평가함으로써 작업을 완료할 수 있게 하여, 불완전하더라도 인간 감독 하에 배포 가능하게 만들고 단일 성능 지표를 제공하는 방법을 보여줍니다. 우리는 RoboVQA-VideoCoCa라는 단일 비디오 조건 모델을 제시하며, 이 모델은 우리의 데이터셋으로 훈련되어 다양한 현실적인 설정에서 고차원적 추론 작업을 수행할 수 있고, 제로샷 상태의 최첨단 시각 언어 모델(VLM) 기준선보다 46% 낮은 인지 개입률을 보이며, 실제 로봇을 장기 작업을 통해 안내할 수 있습니다. 제로샷 최첨단 모델과의 성능 격차는 실제 세계 배포를 위해 많은 근거 데이터가 여전히 수집되어야 함을 나타내며, 확장 가능한 데이터 수집 접근 방식의 중요성을 강조합니다. 마지막으로, 비디오 VLM이 단일 이미지 VLM을 크게 능가하며, 모든 VQA 작업에서 평균 오류율이 19% 감소함을 보여줍니다. 데이터와 비디오는 https://robovqa.github.io에서 확인할 수 있습니다.

English

We present a scalable, bottom-up and intrinsically diverse data collection scheme that can be used for high-level reasoning with long and medium horizons and that has 2.2x higher throughput compared to traditional narrow top-down step-by-step collection. We collect realistic data by performing any user requests within the entirety of 3 office buildings and using multiple robot and human embodiments. With this data, we show that models trained on all embodiments perform better than ones trained on the robot data only, even when evaluated solely on robot episodes. We find that for a fixed collection budget it is beneficial to take advantage of cheaper human collection along with robot collection. We release a large and highly diverse (29,520 unique instructions) dataset dubbed RoboVQA containing 829,502 (video, text) pairs for robotics-focused visual question answering. We also demonstrate how evaluating real robot experiments with an intervention mechanism enables performing tasks to completion, making it deployable with human oversight even if imperfect while also providing a single performance metric. We demonstrate a single video-conditioned model named RoboVQA-VideoCoCa trained on our dataset that is capable of performing a variety of grounded high-level reasoning tasks in broad realistic settings with a cognitive intervention rate 46% lower than the zero-shot state of the art visual language model (VLM) baseline and is able to guide real robots through long-horizon tasks. The performance gap with zero-shot state-of-the-art models indicates that a lot of grounded data remains to be collected for real-world deployment, emphasizing the critical need for scalable data collection approaches. Finally, we show that video VLMs significantly outperform single-image VLMs with an average error rate reduction of 19% across all VQA tasks. Data and videos available at https://robovqa.github.io

로보VQA: 로보틱스를 위한 멀티모달 장기 추론

RoboVQA: Multimodal Long-Horizon Reasoning for Robotics

초록

Support