위치: 기계적 해석성은 SAE에서 특징 일관성을 우선시해야 한다

초록

희소 오토인코더(SAE)는 신경망 활성화를 해석 가능한 특징으로 분해하기 위해 메커니즘 해석성(MI) 분야에서 널리 사용되는 도구입니다. 그러나 표준적인 특징 집합을 식별하려는 목표는 서로 다른 학습 실행에서 학습된 SAE 특징의 불일치로 인해 도전받고 있으며, 이는 MI 연구의 신뢰성과 효율성을 저해하고 있습니다. 본 포지션 논문은 메커니즘 해석성이 SAE의 특징 일관성, 즉 독립적인 실행 간에 동등한 특징 집합으로의 신뢰할 수 있는 수렴을 우선시해야 한다고 주장합니다. 우리는 일관성을 측정하기 위한 실용적인 지표로 쌍별 사전 평균 상관 계수(PW-MCC)를 제안하고, 적절한 아키텍처 선택을 통해 높은 수준의 일관성(LLM 활성화에 대한 TopK SAE의 경우 0.80)을 달성할 수 있음을 보여줍니다. 우리의 기여는 일관성 우선의 이점을 상세히 설명하고, 모델 생물체를 사용한 이론적 근거와 합성 검증을 제공하여 PW-MCC가 실제 복구의 신뢰할 수 있는 대리 지표임을 확인하며, 이러한 결과를 실제 LLM 데이터로 확장하여 높은 특징 일관성이 학습된 특징 설명의 의미론적 유사성과 강하게 상관관계가 있음을 보여줍니다. 우리는 MI 분야에서 견고한 누적적 진전을 촉진하기 위해 커뮤니티 전체가 특징 일관성을 체계적으로 측정하는 방향으로 전환할 것을 촉구합니다.

English

Sparse Autoencoders (SAEs) are a prominent tool in mechanistic interpretability (MI) for decomposing neural network activations into interpretable features. However, the aspiration to identify a canonical set of features is challenged by the observed inconsistency of learned SAE features across different training runs, undermining the reliability and efficiency of MI research. This position paper argues that mechanistic interpretability should prioritize feature consistency in SAEs -- the reliable convergence to equivalent feature sets across independent runs. We propose using the Pairwise Dictionary Mean Correlation Coefficient (PW-MCC) as a practical metric to operationalize consistency and demonstrate that high levels are achievable (0.80 for TopK SAEs on LLM activations) with appropriate architectural choices. Our contributions include detailing the benefits of prioritizing consistency; providing theoretical grounding and synthetic validation using a model organism, which verifies PW-MCC as a reliable proxy for ground-truth recovery; and extending these findings to real-world LLM data, where high feature consistency strongly correlates with the semantic similarity of learned feature explanations. We call for a community-wide shift towards systematically measuring feature consistency to foster robust cumulative progress in MI.

위치: 기계적 해석성은 SAE에서 특징 일관성을 우선시해야 한다

Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs

초록

Support