보고, 지시하고, 비행하라: 학습 없이도 범용 무인 항공 탐색을 위한 VLM 프레임워크

초록

본 논문에서는 시각-언어 모델(Vision-Language Models, VLMs)을 기반으로 구축된 학습이 필요 없는 항공 시각-언어 내비게이션(Aerial Vision-and-Language Navigation, AVLN) 프레임워크인 See, Point, Fly(SPF)를 제안한다. SPF는 어떠한 환경에서도 자유 형식의 지시에 따라 목표 지점으로 이동할 수 있는 능력을 갖추고 있다. 기존의 VLM 기반 접근법이 행동 예측을 텍스트 생성 작업으로 간주한 반면, 본 연구의 핵심 통찰은 AVLN을 위한 행동 예측을 2D 공간적 접지(spatial grounding) 작업으로 간주하는 것이다. SPF는 VLMs를 활용하여 모호한 언어 지시를 입력 이미지 상의 반복적인 2D 웨이포인트(waypoint) 주석으로 분해한다. 예측된 이동 거리와 함께, SPF는 예측된 2D 웨이포인트를 UAV(무인항공기)를 위한 3D 변위 벡터로 변환하여 행동 명령으로 사용한다. 또한, SPF는 더 효율적인 내비게이션을 위해 이동 거리를 적응적으로 조정한다. 특히, SPF는 폐루프 제어 방식으로 내비게이션을 수행하여 UAV가 동적 환경에서 동적 목표물을 추적할 수 있도록 한다. SPF는 DRL 시뮬레이션 벤치마크에서 새로운 최첨단 기술을 제시하며, 이전 최고의 방법을 절대적 차이로 63% 앞섰다. 광범위한 실세계 평가에서도 SPF는 강력한 베이스라인을 큰 차이로 능가했다. 또한, 본 연구는 설계 선택의 효과를 입증하기 위해 포괄적인 어블레이션 연구를 수행했다. 마지막으로, SPF는 다양한 VLMs에 대해 뛰어난 일반화 능력을 보여준다. 프로젝트 페이지: https://spf-web.pages.dev

English

We present See, Point, Fly (SPF), a training-free aerial vision-and-language navigation (AVLN) framework built atop vision-language models (VLMs). SPF is capable of navigating to any goal based on any type of free-form instructions in any kind of environment. In contrast to existing VLM-based approaches that treat action prediction as a text generation task, our key insight is to consider action prediction for AVLN as a 2D spatial grounding task. SPF harnesses VLMs to decompose vague language instructions into iterative annotation of 2D waypoints on the input image. Along with the predicted traveling distance, SPF transforms predicted 2D waypoints into 3D displacement vectors as action commands for UAVs. Moreover, SPF also adaptively adjusts the traveling distance to facilitate more efficient navigation. Notably, SPF performs navigation in a closed-loop control manner, enabling UAVs to follow dynamic targets in dynamic environments. SPF sets a new state of the art in DRL simulation benchmark, outperforming the previous best method by an absolute margin of 63%. In extensive real-world evaluations, SPF outperforms strong baselines by a large margin. We also conduct comprehensive ablation studies to highlight the effectiveness of our design choice. Lastly, SPF shows remarkable generalization to different VLMs. Project page: https://spf-web.pages.dev

보고, 지시하고, 비행하라: 학습 없이도 범용 무인 항공 탐색을 위한 VLM 프레임워크

See, Point, Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation

초록

Support