大语言模型的地理空间机制可解释性
Geospatial Mechanistic Interpretability of Large Language Models
May 6, 2025
作者: Stef De Sabbata, Stefano Mizzaro, Kevin Roitero
cs.AI
摘要
大型语言模型(LLMs)在各类自然语言处理任务中展现了前所未有的能力。它们处理和生成有效文本与代码的能力使其在众多领域无处不在,而作为知识库和“推理”工具的部署仍是持续研究的焦点。在地理学领域,越来越多的文献聚焦于评估LLMs的地理知识及其执行空间推理的能力。然而,关于这些模型内部运作机制,尤其是它们如何处理地理信息,我们仍知之甚少。
在本章中,我们建立了一个研究地理空间机制可解释性的新框架——利用空间分析逆向工程LLMs处理地理信息的方式。我们的目标是深化对这些复杂模型在处理地理信息时生成的内部表征的理解——如果这样的表述不带有过度拟人化色彩,可以说成“LLMs如何思考地理信息”。
首先,我们概述了探测技术在揭示LLMs内部结构中的应用。随后,我们引入了机制可解释性领域,讨论了叠加假说以及稀疏自编码器在将LLMs的多义性内部表征解耦为更可解释的单义性特征中的作用。在实验中,我们运用空间自相关展示了地名特征如何展现出与其地理位置相关的空间模式,从而能够从地理空间角度进行解释,为这些模型处理地理信息的方式提供了洞见。最后,我们讨论了这一框架如何助力地理学中基础模型的研究与应用。
English
Large Language Models (LLMs) have demonstrated unprecedented capabilities
across various natural language processing tasks. Their ability to process and
generate viable text and code has made them ubiquitous in many fields, while
their deployment as knowledge bases and "reasoning" tools remains an area of
ongoing research. In geography, a growing body of literature has been focusing
on evaluating LLMs' geographical knowledge and their ability to perform spatial
reasoning. However, very little is still known about the internal functioning
of these models, especially about how they process geographical information.
In this chapter, we establish a novel framework for the study of geospatial
mechanistic interpretability - using spatial analysis to reverse engineer how
LLMs handle geographical information. Our aim is to advance our understanding
of the internal representations that these complex models generate while
processing geographical information - what one might call "how LLMs think about
geographic information" if such phrasing was not an undue anthropomorphism.
We first outline the use of probing in revealing internal structures within
LLMs. We then introduce the field of mechanistic interpretability, discussing
the superposition hypothesis and the role of sparse autoencoders in
disentangling polysemantic internal representations of LLMs into more
interpretable, monosemantic features. In our experiments, we use spatial
autocorrelation to show how features obtained for placenames display spatial
patterns related to their geographic location and can thus be interpreted
geospatially, providing insights into how these models process geographical
information. We conclude by discussing how our framework can help shape the
study and use of foundation models in geography.Summary
AI-Generated Summary