「カフェの入口はアクセスしやすそうですか？ドアはどこにありますか？」視覚的質問のための地理空間AIエージェントに向けて

要旨

インタラクティブなデジタルマップは、人々が移動し世界について学ぶ方法に革命をもたらしてきた。しかし、それらはGISデータベース（例：道路ネットワーク、POIインデックス）に存在する構造化データに依存しており、世界がどのように見えるかに関する地理視覚的な問いに対応する能力が制限されている。我々は、Geo-Visual Agents（地理視覚エージェント）のビジョンを紹介する。これは、ストリートビュー（例：Googleストリートビュー）、場所ベースの写真（例：TripAdvisor、Yelp）、航空画像（例：衛星写真）といった大規模な地理空間画像リポジトリを従来のGISデータソースと組み合わせて分析し、微妙な視覚空間的な問いを理解し応答するマルチモーダルAIエージェントである。我々はこのビジョンを定義し、センシングとインタラクションのアプローチを説明し、3つの事例を提供し、今後の研究における主要な課題と機会を列挙する。

English

Interactive digital maps have revolutionized how people travel and learn about the world; however, they rely on pre-existing structured data in GIS databases (e.g., road networks, POI indices), limiting their ability to address geo-visual questions related to what the world looks like. We introduce our vision for Geo-Visual Agents--multimodal AI agents capable of understanding and responding to nuanced visual-spatial inquiries about the world by analyzing large-scale repositories of geospatial images, including streetscapes (e.g., Google Street View), place-based photos (e.g., TripAdvisor, Yelp), and aerial imagery (e.g., satellite photos) combined with traditional GIS data sources. We define our vision, describe sensing and interaction approaches, provide three exemplars, and enumerate key challenges and opportunities for future work.

「カフェの入口はアクセスしやすそうですか？ドアはどこにありますか？」視覚的質問のための地理空間AIエージェントに向けて

"Does the cafe entrance look accessible? Where is the door?" Towards Geospatial AI Agents for Visual Inquiries

要旨

Support