ViNT: 시각적 탐색을 위한 기초 모델

초록

범용 사전 학습 모델("기초 모델")은 개별 기계 학습 문제에 대해 훨씬 적은 데이터셋으로도 일반화 가능한 솔루션을 생산할 수 있게 해주었다. 이러한 모델은 일반적으로 약한 감독 하에 크고 다양한 데이터셋으로 학습되며, 개별 하위 애플리케이션에서 사용 가능한 데이터보다 훨씬 더 많은 학습 데이터를 소비한다. 본 논문에서는 범용 사전 학습 모델의 성공을 시각 기반 로봇 내비게이션에 적용하기 위한 기초 모델인 Visual Navigation Transformer(ViNT)를 소개한다. ViNT는 모든 내비게이션 데이터셋과 함께 사용할 수 있는 일반적인 목표 도달 목적 함수로 학습되며, 유연한 Transformer 기반 아키텍처를 사용하여 내비게이션 가능성을 학습하고 다양한 하위 내비게이션 작업에 효율적으로 적응할 수 있도록 한다. ViNT는 다양한 로봇 플랫폼에서 수백 시간에 걸친 로봇 내비게이션 데이터를 포함한 여러 기존 내비게이션 데이터셋으로 학습되었으며, 단일 데이터셋으로 학습된 전문 모델을 능가하는 양의 전이 효과를 보인다. ViNT는 확산 기반의 하위 목표 제안을 통해 새로운 환경을 탐색할 수 있으며, 장거리 휴리스틱을 장착할 경우 킬로미터 규모의 내비게이션 문제를 해결할 수 있다. 또한, 프롬프트 튜닝에서 영감을 받은 기술을 통해 새로운 작업 사양에 적응할 수 있으며, 이때 목표 인코더는 동일한 목표 토큰 공간에 임베딩된 다른 작업 양식(예: GPS 웨이포인트 또는 경로 명령)의 인코딩으로 대체된다. 이러한 유연성과 다양한 하위 문제 영역을 수용할 수 있는 능력은 ViNT를 모바일 로보틱스의 효과적인 기초 모델로 자리매김한다. 비디오, 코드 및 모델 체크포인트는 프로젝트 페이지(https://visualnav-transformer.github.io)에서 확인할 수 있다.

English

General-purpose pre-trained models ("foundation models") have enabled practitioners to produce generalizable solutions for individual machine learning problems with datasets that are significantly smaller than those required for learning from scratch. Such models are typically trained on large and diverse datasets with weak supervision, consuming much more training data than is available for any individual downstream application. In this paper, we describe the Visual Navigation Transformer (ViNT), a foundation model that aims to bring the success of general-purpose pre-trained models to vision-based robotic navigation. ViNT is trained with a general goal-reaching objective that can be used with any navigation dataset, and employs a flexible Transformer-based architecture to learn navigational affordances and enable efficient adaptation to a variety of downstream navigational tasks. ViNT is trained on a number of existing navigation datasets, comprising hundreds of hours of robotic navigation from a variety of different robotic platforms, and exhibits positive transfer, outperforming specialist models trained on singular datasets. ViNT can be augmented with diffusion-based subgoal proposals to explore novel environments, and can solve kilometer-scale navigation problems when equipped with long-range heuristics. ViNT can also be adapted to novel task specifications with a technique inspired by prompt-tuning, where the goal encoder is replaced by an encoding of another task modality (e.g., GPS waypoints or routing commands) embedded into the same space of goal tokens. This flexibility and ability to accommodate a variety of downstream problem domains establishes ViNT as an effective foundation model for mobile robotics. For videos, code, and model checkpoints, see our project page at https://visualnav-transformer.github.io.

ViNT: 시각적 탐색을 위한 기초 모델

ViNT: A Foundation Model for Visual Navigation

초록

Support