硅基友人与祖母:语言模型中实体单元的定位研究
Friends and Grandmothers in Silico: Localizing Entity Cells in Language Models
April 1, 2026
作者: Itay Yona, Dan Barzilay, Michael Karasik, Mor Geva
cs.AI
摘要
語言模型能夠回答許多以實體為核心的事實性問題,但其內部運作機制尚不明確。我們針對多個語言模型展開研究:首先通過針對每個實體的模板化提示定位實體選擇性MLP神經元,並在基於PopQA的問答範例上進行因果干預驗證。從PopQA選取的200個實體數據顯示,定位神經元主要集中於模型淺層。負向消融會導致實體特異性遺忘,而在佔位符處進行受控注入後,相較於平均實體與錯誤單元對照組,答案檢索效果顯著提升。對於多數實體而言,一旦上下文初始化,僅需激活單個定位神經元即可恢復實體一致性預測,這符合緊湊型實體檢索機制而非純粹的逐層漸進增強。對別名、縮寫、拼寫錯誤及多語言形式的魯棒性支持了規範化解釋。該效應雖顯著但非普適:並非所有實體都存在可靠的神經元操控點,且熱門實體的覆蓋率更高。總體而言,這些研究結果為分析和調控實體關聯的實質性行為提供了稀疏且具因果可操作性的切入點。
English
Language models can answer many entity-centric factual questions, but it remains unclear which internal mechanisms are involved in this process. We study this question across multiple language models. We localize entity-selective MLP neurons using templated prompts about each entity, and then validate them with causal interventions on PopQA-based QA examples. On a curated set of 200 entities drawn from PopQA, localized neurons concentrate in early layers. Negative ablation produces entity-specific amnesia, while controlled injection at a placeholder token improves answer retrieval relative to mean-entity and wrong-cell controls. For many entities, activating a single localized neuron is sufficient to recover entity-consistent predictions once the context is initialized, consistent with compact entity retrieval rather than purely gradual enrichment across depth. Robustness to aliases, acronyms, misspellings, and multilingual forms supports a canonicalization interpretation. The effect is strong but not universal: not every entity admits a reliable single-neuron handle, and coverage is higher for popular entities. Overall, these results identify sparse, causally actionable access points for analyzing and modulating entity-conditioned factual behavior.