ChatPaper.aiChatPaper

貧困地圖繪製的柏拉圖式表徵:統一視覺-語言編碼抑或代理誘導的新穎性?

Platonic Representations for Poverty Mapping: Unified Vision-Language Codes or Agent-Induced Novelty?

August 1, 2025
作者: Satiyabooshan Murugaboopathy, Connor T. Jerzak, Adel Daoud
cs.AI

摘要

本研究探討了家庭財富等社會經濟指標是否能在衛星影像(捕捉物理特徵)及網路文本(反映歷史/經濟敘事)中留下可復原的印記。利用非洲社區的人口與健康調查(DHS)數據,我們將Landsat影像與基於位置/年份生成的大型語言模型(LLM)文本描述配對,並結合由AI搜索代理從網路資源中檢索的文本。我們開發了一個多模態框架,通過五種管道預測家庭財富(國際財富指數):(i) 基於衛星影像的視覺模型,(ii) 僅使用位置/年份的LLM,(iii) AI代理搜索/合成網路文本,(iv) 聯合圖像-文本編碼器,(v) 所有信號的集成。該框架帶來三項貢獻。首先,融合視覺與代理/LLM文本在財富預測上優於僅使用視覺的基準(例如,在樣本外分割上的R平方值為0.77對比0.63),其中LLM內部知識比代理檢索的文本更有效,提升了跨國與跨時間泛化的穩健性。其次,我們發現部分表徵收斂:視覺/語言模態的融合嵌入呈現中等相關性(對齊後的中位餘弦相似度為0.60),暗示了物質福祉的共享潛在編碼,同時保留了互補細節,這與柏拉圖表徵假說一致。儘管僅使用LLM文本優於代理檢索數據,挑戰了我們的代理誘導新穎性假說,但在某些分割中結合代理數據帶來的微小增益,弱支持了代理收集信息引入獨特表徵結構的觀點,這些結構未被靜態LLM知識完全捕捉。第三,我們發布了一個大規模多模態數據集,包含超過60,000個DHS集群,鏈接至衛星影像、LLM生成的描述及代理檢索的文本。
English
We investigate whether socio-economic indicators like household wealth leave recoverable imprints in satellite imagery (capturing physical features) and Internet-sourced text (reflecting historical/economic narratives). Using Demographic and Health Survey (DHS) data from African neighborhoods, we pair Landsat images with LLM-generated textual descriptions conditioned on location/year and text retrieved by an AI search agent from web sources. We develop a multimodal framework predicting household wealth (International Wealth Index) through five pipelines: (i) vision model on satellite images, (ii) LLM using only location/year, (iii) AI agent searching/synthesizing web text, (iv) joint image-text encoder, (v) ensemble of all signals. Our framework yields three contributions. First, fusing vision and agent/LLM text outperforms vision-only baselines in wealth prediction (e.g., R-squared of 0.77 vs. 0.63 on out-of-sample splits), with LLM-internal knowledge proving more effective than agent-retrieved text, improving robustness to out-of-country and out-of-time generalization. Second, we find partial representational convergence: fused embeddings from vision/language modalities correlate moderately (median cosine similarity of 0.60 after alignment), suggesting a shared latent code of material well-being while retaining complementary details, consistent with the Platonic Representation Hypothesis. Although LLM-only text outperforms agent-retrieved data, challenging our Agent-Induced Novelty Hypothesis, modest gains from combining agent data in some splits weakly support the notion that agent-gathered information introduces unique representational structures not fully captured by static LLM knowledge. Third, we release a large-scale multimodal dataset comprising more than 60,000 DHS clusters linked to satellite images, LLM-generated descriptions, and agent-retrieved texts.
PDF22August 5, 2025