ESARBench：面向无人机智能体化搜救任务的基准测试平台

摘要

多模态大语言模型（MLLM）的快速发展使无人机在空间推理、语义理解和复杂决策方面展现出卓越能力，使其天然适用于无人机搜救任务。然而，现有无人机搜救研究主要集中于传统视觉与路径规划方法，缺乏面向具身智能体的全面统一基准。为填补这一空白，我们首次提出具身搜救（ESAR）新任务，要求空中智能体自主探索复杂环境、识别救援线索、推理受害者位置并执行智能决策。同时，我们推出ESARBench——首个专为评估MLLM驱动无人机智能体在高真实度搜救场景中表现的综合基准。通过Unreal Engine 5与AirSim构建的四个高保真大规模开放环境，直接基于真实世界地理信息系统数据映射而成，确保景观呈现照片级真实感。为严格模拟实际救援行动，基准引入动态变量包括天气条件、昼夜时段及随机线索分布。此外，我们根据真实救援案例创建包含600项任务的数据集，并提出一套鲁棒的评估指标。通过对比传统启发式算法与先进的地面/空中MLLM目标导航智能体等多类基线，实验结果凸显了ESAR任务的挑战性，揭示出空间记忆、空中适应性以及搜索效率与飞行安全权衡等关键瓶颈。我们期待ESARBench能成为推动具身搜救领域研究的重要资源。源代码与项目页面：https://4amgodvzx.github.io/ESAR.github.io。

English

The rapid advancement of Multimodal Large Language Models (MLLMs) has empowered Unmanned Aerial Vehicle (UAV) with exceptional capabilities in spatial reasoning, semantic understanding, and complex decision-making, making them inherently suited for UAV Search and Rescue (SAR). However, existing UAV SAR research is dominated by traditional vision and path-planning methods and lacks a comprehensive and unified benchmark for embodied agents. To bridge this gap, we first propose the novel task of Embodied Search and Rescue (ESAR), which requires aerial agents to autonomously explore complex environments, identify rescue clues, and reason about victim locations to execute informed decision-making. Additionally, we present ESARBench, the first comprehensive benchmark designed to evaluate MLLM-driven UAV agents in highly realistic SAR scenarios. Leveraging Unreal Engine 5 and AirSim, we construct four high-fidelity, large-scale open environments mapped directly from real-world Geographic Information System (GIS) data to ensure photorealistic landscapes. To rigorously simulate actual rescue operations, our benchmark incorporates dynamic variables including weather conditions, time of day, and stochastic clue placement. Furthermore, we create a dataset of 600 tasks modeled after real-world rescue cases and propose a robust set of evaluation metrics. We evaluate diverse baselines, ranging from traditional heuristics to advanced ground and aerial MLLM-based ObjectNav agents. Experimental results highlight the challenges in ESAR, revealing critical bottlenecks in spatial memory, aerial adaptation, and the trade-off between search efficiency and flight safety. We hope ESARBench serves as a valuable resource to advance research on Embodied Search and Rescue domain. Source code and project page: https://4amgodvzx.github.io/ESAR.github.io.