GAEA:具備地理定位意識的對話模型
GAEA: A Geolocation Aware Conversational Model
March 20, 2025
作者: Ron Campos, Ashmal Vayani, Parth Parag Kulkarni, Rohit Gupta, Aritra Dutta, Mubarak Shah
cs.AI
摘要
圖像地理定位,傳統上是指人工智慧模型預測圖像的精確GPS座標,這是一項具有多種下游應用的挑戰性任務。然而,用戶無法利用該模型來進一步獲取除GPS座標以外的知識;該模型缺乏對位置的理解以及與用戶進行對話的能力。近年來,隨著大型多模態模型(LMMs)的巨大進展,無論是專有還是開源的研究者都嘗試通過LMMs來進行圖像地理定位。然而,問題依然存在;除了通用任務外,對於更專業的下游任務(其中之一就是地理定位),LMMs表現不佳。在本研究中,我們提出通過引入一個對話模型GAEA來解決這一問題,該模型能夠根據用戶需求提供有關圖像位置的信息。目前尚無大規模數據集能夠訓練此類模型。因此,我們提出了一個全面的數據集GAEA,包含80萬張圖像和約160萬個問答對,這些數據是通過利用OpenStreetMap(OSM)屬性和地理上下文線索構建的。為了進行定量評估,我們提出了一個多樣化的基準,包含4K個圖像-文本對,以評估配備多種問題類型的對話能力。我們考慮了11個最先進的開源和專有LMMs,並證明GAEA顯著優於最佳開源模型LLaVA-OneVision,提升了25.69%,並優於最佳專有模型GPT-4o,提升了8.28%。我們的數據集、模型和代碼均已公開。
English
Image geolocalization, in which, traditionally, an AI model predicts the
precise GPS coordinates of an image is a challenging task with many downstream
applications. However, the user cannot utilize the model to further their
knowledge other than the GPS coordinate; the model lacks an understanding of
the location and the conversational ability to communicate with the user. In
recent days, with tremendous progress of large multimodal models (LMMs)
proprietary and open-source researchers have attempted to geolocalize images
via LMMs. However, the issues remain unaddressed; beyond general tasks, for
more specialized downstream tasks, one of which is geolocalization, LMMs
struggle. In this work, we propose to solve this problem by introducing a
conversational model GAEA that can provide information regarding the location
of an image, as required by a user. No large-scale dataset enabling the
training of such a model exists. Thus we propose a comprehensive dataset GAEA
with 800K images and around 1.6M question answer pairs constructed by
leveraging OpenStreetMap (OSM) attributes and geographical context clues. For
quantitative evaluation, we propose a diverse benchmark comprising 4K
image-text pairs to evaluate conversational capabilities equipped with diverse
question types. We consider 11 state-of-the-art open-source and proprietary
LMMs and demonstrate that GAEA significantly outperforms the best open-source
model, LLaVA-OneVision by 25.69% and the best proprietary model, GPT-4o by
8.28%. Our dataset, model and codes are availableSummary
AI-Generated Summary