『シリコの友人と祖母たち：言語モデルにおけるエンティティ細胞の局在化』

要旨

言語モデルは多くの実体中心の事実質問に回答できるが、このプロセスにどの内部メカニズムが関与しているかは不明なままである。我々は複数の言語モデルにわたってこの問題を調査する。各実体に関するテンプレート化されたプロンプトを用いて実体選択的MLPニューロンを局在化し、PopQAベースのQA事例に対する因果的介入によって検証する。PopQAから抽出した200実体の精選セットにおいて、局在化ニューロンは初期層に集中している。負のアブレーションは実体特異的な記憶喪失を引き起こし、プレースホルダートークンへの制御注入は、平均実体および誤ったセル制御と比較して回答検索を改善する。多くの実体において、コンテキストが初期化されれば、単一の局在化ニューロンを活性化するだけで実体整合的な予測を回復可能であり、これは純粋な深度横断的な漸進的豊富化ではなく、コンパクトな実体検索と整合する。別名、頭字語、誤字、多言語形式への頑健性は標準化解釈を支持する。この効果は強いが普遍的ではない：全ての実体が信頼性の高い単一ニューロンハンドルを許容するわけではなく、人気実体ではカバレッジが高い。全体として、これらの結果は、実体条件付けされた事実的挙動を分析・調整するための疎で因果的に実行可能なアクセスポイントを同定する。

English

Language models can answer many entity-centric factual questions, but it remains unclear which internal mechanisms are involved in this process. We study this question across multiple language models. We localize entity-selective MLP neurons using templated prompts about each entity, and then validate them with causal interventions on PopQA-based QA examples. On a curated set of 200 entities drawn from PopQA, localized neurons concentrate in early layers. Negative ablation produces entity-specific amnesia, while controlled injection at a placeholder token improves answer retrieval relative to mean-entity and wrong-cell controls. For many entities, activating a single localized neuron is sufficient to recover entity-consistent predictions once the context is initialized, consistent with compact entity retrieval rather than purely gradual enrichment across depth. Robustness to aliases, acronyms, misspellings, and multilingual forms supports a canonicalization interpretation. The effect is strong but not universal: not every entity admits a reliable single-neuron handle, and coverage is higher for popular entities. Overall, these results identify sparse, causally actionable access points for analyzing and modulating entity-conditioned factual behavior.

『シリコの友人と祖母たち：言語モデルにおけるエンティティ細胞の局在化』

Friends and Grandmothers in Silico: Localizing Entity Cells in Language Models

要旨

Support