贫困地图绘制的柏拉图式表征：统一视觉-语言编码还是智能体诱导的新颖性？

摘要

我们探究社会经济指标（如家庭财富）是否会在卫星影像（捕捉物理特征）和互联网文本（反映历史/经济叙事）中留下可恢复的痕迹。利用非洲社区的《人口与健康调查》（DHS）数据，我们将Landsat卫星图像与基于位置/年份生成的大型语言模型（LLM）文本描述，以及由AI搜索代理从网络资源中检索的文本进行配对。我们开发了一个多模态框架，通过五种管道预测家庭财富（国际财富指数）：(i) 基于卫星图像的视觉模型，(ii) 仅使用位置/年份的LLM，(iii) AI代理搜索/综合网络文本，(iv) 联合图像-文本编码器，(v) 所有信号的集成。我们的框架带来了三项贡献。首先，融合视觉与代理/LLM文本在财富预测上优于仅依赖视觉的基线（例如，样本外分割的R平方值为0.77对比0.63），其中LLM内部知识比代理检索的文本更为有效，提升了跨国家和跨时间泛化的鲁棒性。其次，我们发现部分表征趋同：视觉与语言模态融合后的嵌入呈现中等相关性（对齐后余弦相似度中位数为0.60），暗示了物质福祉的共享潜在编码，同时保留了互补细节，这与柏拉图表征假说一致。尽管仅使用LLM文本优于代理检索数据，挑战了我们的代理诱导新颖性假说，但在某些分割中结合代理数据带来的微小增益，弱支持了代理收集的信息引入了静态LLM知识未能完全捕捉的独特表征结构的观点。第三，我们发布了一个大规模多模态数据集，包含超过60,000个DHS集群，关联了卫星图像、LLM生成的描述及代理检索的文本。

English

We investigate whether socio-economic indicators like household wealth leave recoverable imprints in satellite imagery (capturing physical features) and Internet-sourced text (reflecting historical/economic narratives). Using Demographic and Health Survey (DHS) data from African neighborhoods, we pair Landsat images with LLM-generated textual descriptions conditioned on location/year and text retrieved by an AI search agent from web sources. We develop a multimodal framework predicting household wealth (International Wealth Index) through five pipelines: (i) vision model on satellite images, (ii) LLM using only location/year, (iii) AI agent searching/synthesizing web text, (iv) joint image-text encoder, (v) ensemble of all signals. Our framework yields three contributions. First, fusing vision and agent/LLM text outperforms vision-only baselines in wealth prediction (e.g., R-squared of 0.77 vs. 0.63 on out-of-sample splits), with LLM-internal knowledge proving more effective than agent-retrieved text, improving robustness to out-of-country and out-of-time generalization. Second, we find partial representational convergence: fused embeddings from vision/language modalities correlate moderately (median cosine similarity of 0.60 after alignment), suggesting a shared latent code of material well-being while retaining complementary details, consistent with the Platonic Representation Hypothesis. Although LLM-only text outperforms agent-retrieved data, challenging our Agent-Induced Novelty Hypothesis, modest gains from combining agent data in some splits weakly support the notion that agent-gathered information introduces unique representational structures not fully captured by static LLM knowledge. Third, we release a large-scale multimodal dataset comprising more than 60,000 DHS clusters linked to satellite images, LLM-generated descriptions, and agent-retrieved texts.

贫困地图绘制的柏拉图式表征：统一视觉-语言编码还是智能体诱导的新颖性？

Platonic Representations for Poverty Mapping: Unified Vision-Language Codes or Agent-Induced Novelty?

摘要

Support