ESARBench: 에이전트 기반 UAV 체화형 탐색 및 구조 벤치마크

초록

다중 모드 대규모 언어 모델(MLLM)의 급속한 발전으로 무인 항공기(UAV)가 공간 추론, 의미론적 이해, 복잡한 의사 결정 분야에서 뛰어난 능력을 갖추게 되어 UAV 구조 탐색(SAR)에 본질적으로 적합해졌습니다. 그러나 기존 UAV SAR 연구는 전통적인 컴퓨터 비전 및 경로 계획 방법이 주를 이루며, 구현형 에이전트를 위한 포괄적이고 통합된 벤치마크가 부족한 실정입니다. 이러한 격차를 해소하기 위해 본 연구에서는 먼저 공중 에이전트가 복잡한 환경을 자율적으로 탐색하고, 구조 단서를 식별하며, 피해자 위치를 추론하여 정보에 기반한 의사 결정을 수행해야 하는 새로운 과제인 구현형 구조 탐색(ESAR)을 제안합니다. 더불어, 매우 현실적인 SAR 시나리오에서 MLLM 기반 UAV 에이전트를 평가하기 위해 최초로 포괄적인 벤치마크인 ESARBench를 소개합니다. Unreal Engine 5와 AirSim을 활용하여 실제 세계의 GIS 데이터를 직접 매핑한 4개의 고정밀 대규모 오픈 환경을 구축하여 사실적인 경관을 보장합니다. 실제 구조 작업을 엄밀하게 모의 실험하기 위해 본 벤치마크에는 기상 조건, 시간대, 확률적 단서 배치 등 동적 변수를 통합했습니다. 또한 실제 구조 사례를 바탕으로 모델링된 600개의 태스크 데이터셋을 생성하고 강력한 평가 지표 세트를 제안합니다. 전통적인 휴리스틱 방법부터 최신 지상 및 공중 MLLM 기반 ObjectNav 에이전트에 이르기까지 다양한 기준 모델을 평가합니다. 실험 결과는 ESAR 과제의 어려움을 보여주며, 공간 기억, 공중 적응, 탐색 효율성과 비행 안전성 간의 트레이드오프에서 중요한 병목 현상을 드러냅니다. 본 연구팀은 ESARBench가 구현형 구조 탐색 분야의 연구 발전에 유용한 자원으로 활용되기를 기대합니다. 소스 코드 및 프로젝트 페이지: https://4amgodvzx.github.io/ESAR.github.io.

English

The rapid advancement of Multimodal Large Language Models (MLLMs) has empowered Unmanned Aerial Vehicle (UAV) with exceptional capabilities in spatial reasoning, semantic understanding, and complex decision-making, making them inherently suited for UAV Search and Rescue (SAR). However, existing UAV SAR research is dominated by traditional vision and path-planning methods and lacks a comprehensive and unified benchmark for embodied agents. To bridge this gap, we first propose the novel task of Embodied Search and Rescue (ESAR), which requires aerial agents to autonomously explore complex environments, identify rescue clues, and reason about victim locations to execute informed decision-making. Additionally, we present ESARBench, the first comprehensive benchmark designed to evaluate MLLM-driven UAV agents in highly realistic SAR scenarios. Leveraging Unreal Engine 5 and AirSim, we construct four high-fidelity, large-scale open environments mapped directly from real-world Geographic Information System (GIS) data to ensure photorealistic landscapes. To rigorously simulate actual rescue operations, our benchmark incorporates dynamic variables including weather conditions, time of day, and stochastic clue placement. Furthermore, we create a dataset of 600 tasks modeled after real-world rescue cases and propose a robust set of evaluation metrics. We evaluate diverse baselines, ranging from traditional heuristics to advanced ground and aerial MLLM-based ObjectNav agents. Experimental results highlight the challenges in ESAR, revealing critical bottlenecks in spatial memory, aerial adaptation, and the trade-off between search efficiency and flight safety. We hope ESARBench serves as a valuable resource to advance research on Embodied Search and Rescue domain. Source code and project page: https://4amgodvzx.github.io/ESAR.github.io.

ESARBench: 에이전트 기반 UAV 체화형 탐색 및 구조 벤치마크

ESARBench: A Benchmark for Agentic UAV Embodied Search and Rescue

초록

Support