硅中友人与祖母：语言模型中实体单元的定位研究

摘要

语言模型能够回答许多以实体为核心的事实性问题，但其内部机制尚不明确。我们通过多个语言模型对此展开研究：首先使用针对各实体的模板化提示定位具有实体选择性的MLP神经元，随后基于PopQA问答样本进行因果干预验证。在从PopQA选取的200个实体数据集中，定位到的神经元集中分布于模型浅层。负向消融会引发实体特异性遗忘，而在占位符处进行受控注入后，相较于均值实体与错误单元对照组，答案检索效果显著提升。对于多数实体，一旦上下文初始化，仅需激活单个定位神经元即可恢复实体一致性预测，这符合紧凑型实体检索机制而非纯粹的逐层渐进式信息积累。模型对别名、缩写、拼写错误及多语言形式的鲁棒性支持了实体规范化解释。该效应显著但非普适：并非所有实体都存在可靠的单神经元操控点，且热门实体的覆盖率更高。总体而言，这些研究结果为分析和调控实体关联的事实性行为提供了稀疏、可因果干预的切入点。

English

Language models can answer many entity-centric factual questions, but it remains unclear which internal mechanisms are involved in this process. We study this question across multiple language models. We localize entity-selective MLP neurons using templated prompts about each entity, and then validate them with causal interventions on PopQA-based QA examples. On a curated set of 200 entities drawn from PopQA, localized neurons concentrate in early layers. Negative ablation produces entity-specific amnesia, while controlled injection at a placeholder token improves answer retrieval relative to mean-entity and wrong-cell controls. For many entities, activating a single localized neuron is sufficient to recover entity-consistent predictions once the context is initialized, consistent with compact entity retrieval rather than purely gradual enrichment across depth. Robustness to aliases, acronyms, misspellings, and multilingual forms supports a canonicalization interpretation. The effect is strong but not universal: not every entity admits a reliable single-neuron handle, and coverage is higher for popular entities. Overall, these results identify sparse, causally actionable access points for analyzing and modulating entity-conditioned factual behavior.

硅中友人与祖母：语言模型中实体单元的定位研究

Friends and Grandmothers in Silico: Localizing Entity Cells in Language Models

摘要

Support