GAEA:地理感知对话模型
GAEA: A Geolocation Aware Conversational Model
March 20, 2025
作者: Ron Campos, Ashmal Vayani, Parth Parag Kulkarni, Rohit Gupta, Aritra Dutta, Mubarak Shah
cs.AI
摘要
图像地理定位,传统上由AI模型预测图像的精确GPS坐标,是一项具有众多下游应用的挑战性任务。然而,用户无法利用该模型获取除GPS坐标之外的更多知识;模型缺乏对位置的理解以及与用户进行对话的能力。近期,随着大型多模态模型(LMMs)的巨大进展,无论是专有还是开源领域的研究者都尝试通过LMMs实现图像地理定位。然而,问题依然未解;在超越一般任务、针对更为专业的下游任务(如地理定位)时,LMMs表现欠佳。在本研究中,我们提出通过引入一个对话模型GAEA来解决这一问题,该模型能够根据用户需求提供图像位置的相关信息。目前尚不存在支持此类模型训练的大规模数据集。因此,我们构建了一个综合数据集GAEA,包含80万张图像及约160万对问答,这些数据通过利用OpenStreetMap(OSM)属性和地理上下文线索构建而成。为进行定量评估,我们提出了一个包含4千对图像-文本的多样化基准,用以评估模型在应对多种问题类型时的对话能力。我们考察了11个最先进的开源及专有LMMs,并证明GAEA显著优于最佳开源模型LLaVA-OneVision,提升幅度达25.69%,同时超越最佳专有模型GPT-4o,提升8.28%。我们的数据集、模型及代码均已公开。
English
Image geolocalization, in which, traditionally, an AI model predicts the
precise GPS coordinates of an image is a challenging task with many downstream
applications. However, the user cannot utilize the model to further their
knowledge other than the GPS coordinate; the model lacks an understanding of
the location and the conversational ability to communicate with the user. In
recent days, with tremendous progress of large multimodal models (LMMs)
proprietary and open-source researchers have attempted to geolocalize images
via LMMs. However, the issues remain unaddressed; beyond general tasks, for
more specialized downstream tasks, one of which is geolocalization, LMMs
struggle. In this work, we propose to solve this problem by introducing a
conversational model GAEA that can provide information regarding the location
of an image, as required by a user. No large-scale dataset enabling the
training of such a model exists. Thus we propose a comprehensive dataset GAEA
with 800K images and around 1.6M question answer pairs constructed by
leveraging OpenStreetMap (OSM) attributes and geographical context clues. For
quantitative evaluation, we propose a diverse benchmark comprising 4K
image-text pairs to evaluate conversational capabilities equipped with diverse
question types. We consider 11 state-of-the-art open-source and proprietary
LMMs and demonstrate that GAEA significantly outperforms the best open-source
model, LLaVA-OneVision by 25.69% and the best proprietary model, GPT-4o by
8.28%. Our dataset, model and codes are availableSummary
AI-Generated Summary