贫困地图绘制的柏拉图式表征:统一视觉-语言编码还是智能体诱导的新颖性?
Platonic Representations for Poverty Mapping: Unified Vision-Language Codes or Agent-Induced Novelty?
August 1, 2025
作者: Satiyabooshan Murugaboopathy, Connor T. Jerzak, Adel Daoud
cs.AI
摘要
我们探究社会经济指标(如家庭财富)是否会在卫星影像(捕捉物理特征)和互联网文本(反映历史/经济叙事)中留下可恢复的痕迹。利用非洲社区的《人口与健康调查》(DHS)数据,我们将Landsat卫星图像与基于位置/年份生成的大型语言模型(LLM)文本描述,以及由AI搜索代理从网络资源中检索的文本进行配对。我们开发了一个多模态框架,通过五种管道预测家庭财富(国际财富指数):(i) 基于卫星图像的视觉模型,(ii) 仅使用位置/年份的LLM,(iii) AI代理搜索/综合网络文本,(iv) 联合图像-文本编码器,(v) 所有信号的集成。我们的框架带来了三项贡献。首先,融合视觉与代理/LLM文本在财富预测上优于仅依赖视觉的基线(例如,样本外分割的R平方值为0.77对比0.63),其中LLM内部知识比代理检索的文本更为有效,提升了跨国家和跨时间泛化的鲁棒性。其次,我们发现部分表征趋同:视觉与语言模态融合后的嵌入呈现中等相关性(对齐后余弦相似度中位数为0.60),暗示了物质福祉的共享潜在编码,同时保留了互补细节,这与柏拉图表征假说一致。尽管仅使用LLM文本优于代理检索数据,挑战了我们的代理诱导新颖性假说,但在某些分割中结合代理数据带来的微小增益,弱支持了代理收集的信息引入了静态LLM知识未能完全捕捉的独特表征结构的观点。第三,我们发布了一个大规模多模态数据集,包含超过60,000个DHS集群,关联了卫星图像、LLM生成的描述及代理检索的文本。
English
We investigate whether socio-economic indicators like household wealth leave
recoverable imprints in satellite imagery (capturing physical features) and
Internet-sourced text (reflecting historical/economic narratives). Using
Demographic and Health Survey (DHS) data from African neighborhoods, we pair
Landsat images with LLM-generated textual descriptions conditioned on
location/year and text retrieved by an AI search agent from web sources. We
develop a multimodal framework predicting household wealth (International
Wealth Index) through five pipelines: (i) vision model on satellite images,
(ii) LLM using only location/year, (iii) AI agent searching/synthesizing web
text, (iv) joint image-text encoder, (v) ensemble of all signals. Our framework
yields three contributions. First, fusing vision and agent/LLM text outperforms
vision-only baselines in wealth prediction (e.g., R-squared of 0.77 vs. 0.63 on
out-of-sample splits), with LLM-internal knowledge proving more effective than
agent-retrieved text, improving robustness to out-of-country and out-of-time
generalization. Second, we find partial representational convergence: fused
embeddings from vision/language modalities correlate moderately (median cosine
similarity of 0.60 after alignment), suggesting a shared latent code of
material well-being while retaining complementary details, consistent with the
Platonic Representation Hypothesis. Although LLM-only text outperforms
agent-retrieved data, challenging our Agent-Induced Novelty Hypothesis, modest
gains from combining agent data in some splits weakly support the notion that
agent-gathered information introduces unique representational structures not
fully captured by static LLM knowledge. Third, we release a large-scale
multimodal dataset comprising more than 60,000 DHS clusters linked to satellite
images, LLM-generated descriptions, and agent-retrieved texts.