貧困マッピングのためのプラトニック表現：統一された視覚-言語コードか、エージェント誘導型の新規性か？

要旨

世帯の富などの社会経済指標が、衛星画像（物理的特徴を捉える）やインターネットから収集されたテキスト（歴史的・経済的ナラティブを反映する）に回復可能な痕跡を残すかどうかを調査します。アフリカの地域におけるDemographic and Health Survey（DHS）データを使用し、Landsat画像と、場所/年を条件としたLLM生成のテキスト記述、およびAI検索エージェントがウェブソースから取得したテキストを組み合わせます。私たちは、世帯の富（International Wealth Index）を予測するためのマルチモーダルフレームワークを開発し、以下の5つのパイプラインを構築します：(i) 衛星画像に基づく視覚モデル、(ii) 場所/年のみを使用するLLM、(iii) ウェブテキストを検索・合成するAIエージェント、(iv) 画像とテキストの結合エンコーダ、(v) すべての信号を統合したアンサンブル。このフレームワークは3つの貢献をもたらします。第一に、視覚とエージェント/LLMテキストを融合させることで、富の予測において視覚のみのベースラインを上回り（例えば、サンプル外分割でのR二乗値が0.77対0.63）、LLMの内部知識がエージェントが取得したテキストよりも効果的であることが示され、国や時間を超えた一般化に対する堅牢性が向上します。第二に、部分的な表現の収束が見られました：視覚と言語モダリティから融合された埋め込みは中程度に相関し（アラインメント後の中央コサイン類似度0.60）、物質的豊かさの共有された潜在コードを示唆しつつ、補完的な詳細を保持しており、プラトニック表現仮説と一致しています。LLMのみのテキストがエージェントが取得したデータを上回り、エージェント誘導新奇性仮説に挑戦するものの、一部の分割でエージェントデータを組み合わせることによる控えめな改善は、エージェントが収集した情報が静的LLM知識では完全に捉えられない独自の表現構造を導入するという考えを弱く支持しています。第三に、60,000以上のDHSクラスターとリンクされた衛星画像、LLM生成の記述、エージェントが取得したテキストを含む大規模なマルチモーダルデータセットを公開します。

English

We investigate whether socio-economic indicators like household wealth leave recoverable imprints in satellite imagery (capturing physical features) and Internet-sourced text (reflecting historical/economic narratives). Using Demographic and Health Survey (DHS) data from African neighborhoods, we pair Landsat images with LLM-generated textual descriptions conditioned on location/year and text retrieved by an AI search agent from web sources. We develop a multimodal framework predicting household wealth (International Wealth Index) through five pipelines: (i) vision model on satellite images, (ii) LLM using only location/year, (iii) AI agent searching/synthesizing web text, (iv) joint image-text encoder, (v) ensemble of all signals. Our framework yields three contributions. First, fusing vision and agent/LLM text outperforms vision-only baselines in wealth prediction (e.g., R-squared of 0.77 vs. 0.63 on out-of-sample splits), with LLM-internal knowledge proving more effective than agent-retrieved text, improving robustness to out-of-country and out-of-time generalization. Second, we find partial representational convergence: fused embeddings from vision/language modalities correlate moderately (median cosine similarity of 0.60 after alignment), suggesting a shared latent code of material well-being while retaining complementary details, consistent with the Platonic Representation Hypothesis. Although LLM-only text outperforms agent-retrieved data, challenging our Agent-Induced Novelty Hypothesis, modest gains from combining agent data in some splits weakly support the notion that agent-gathered information introduces unique representational structures not fully captured by static LLM knowledge. Third, we release a large-scale multimodal dataset comprising more than 60,000 DHS clusters linked to satellite images, LLM-generated descriptions, and agent-retrieved texts.

貧困マッピングのためのプラトニック表現：統一された視覚-言語コードか、エージェント誘導型の新規性か？

Platonic Representations for Poverty Mapping: Unified Vision-Language Codes or Agent-Induced Novelty?

要旨

Support