실리코의 친구들과 할머니들: 언어 모델에서 엔티티 셀의 지역화

초록

언어 모델은 많은 실체 중심 사실 질문에 답할 수 있지만, 이 과정에 어떤 내부 메커니즘이 관여하는지는 여전히 명확하지 않습니다. 우리는 여러 언어 모델에 걸쳐 이 문제를 연구합니다. 각 실체에 대한 템플릿 기반 프롬프트를 사용하여 실체 선택적 MLP 뉴런을 위치 특정화하고, PopQA 기반 QA 예제에 대한 인과적 개입을 통해 이를 검증합니다. PopQA에서 추출한 200개 실체로 구성된 큐레이션된 데이터셋에서, 위치 특정화된 뉴런은 초기 계층에 집중됩니다. 음성 억제는 실체 특정적 기억 상실을 발생시키는 반면, 플레이스홀더 토큰에서의 제어적 주입은 평균 실체 및 잘못된 셀 대조군에 비해 답변 검색을 개선합니다. 많은 실체에 대해, 맥락이 초기화되면 단일 위치 특정화 뉴런을 활성화하는 것만으로도 실체 일관적 예측을 회복할 수 있으며, 이는 순수한 점진적 심화보다는 간결한 실체 검색에 부합합니다. 별칭, 두문자어, 오타, 다국어 형태에 대한 강건성은 표준화 해석을 지지합니다. 이 효과는 강력하지만 보편적이지는 않습니다. 모든 실체가 신뢰할 수 있는 단일 뉴런 핸들을 허용하는 것은 아니며, 인기 있는 실체에 대한 커버리지가 더 높습니다. 전반적으로, 이러한 결과는 실체 조건화된 사실적 행동을 분석하고 조절하기 위한 희소하고 인과적으로 실행 가능한 접근점을 확인해 줍니다.

English

Language models can answer many entity-centric factual questions, but it remains unclear which internal mechanisms are involved in this process. We study this question across multiple language models. We localize entity-selective MLP neurons using templated prompts about each entity, and then validate them with causal interventions on PopQA-based QA examples. On a curated set of 200 entities drawn from PopQA, localized neurons concentrate in early layers. Negative ablation produces entity-specific amnesia, while controlled injection at a placeholder token improves answer retrieval relative to mean-entity and wrong-cell controls. For many entities, activating a single localized neuron is sufficient to recover entity-consistent predictions once the context is initialized, consistent with compact entity retrieval rather than purely gradual enrichment across depth. Robustness to aliases, acronyms, misspellings, and multilingual forms supports a canonicalization interpretation. The effect is strong but not universal: not every entity admits a reliable single-neuron handle, and coverage is higher for popular entities. Overall, these results identify sparse, causally actionable access points for analyzing and modulating entity-conditioned factual behavior.

실리코의 친구들과 할머니들: 언어 모델에서 엔티티 셀의 지역화

Friends and Grandmothers in Silico: Localizing Entity Cells in Language Models

초록

Support