본질적으로 해석 가능한 전문가 혼합 모델

초록

대규모 언어 모델의 뉴런들은 종종 다의성을 보이며, 여러 개의 관련 없는 개념을 동시에 인코딩함으로써 해석 가능성을 흐리게 합니다. 사후 해석 방법에 의존하는 대신, 본질적으로 해석 가능하도록 설계된 Mixture-of-Experts(MoE) 언어 모델인 MoE-X를 제안합니다. 우리의 접근 방식은 언어 모델에서 더 넓은 네트워크와 희소 활성화가 해석 가능한 요소를 포착할 가능성이 더 높다는 관찰에 기반합니다. 그러나 이러한 대규모 희소 네트워크를 직접 학습시키는 것은 계산적으로 비현실적입니다. MoE 아키텍처는 주어진 입력에 대해 전문가의 일부만 활성화함으로써 확장 가능한 대안을 제공하며, 이는 해석 가능성 목표와 본질적으로 일치합니다. MoE-X에서는 MoE 레이어를 동등한 희소 대규모 MLP로 재구성함으로써 이러한 연결을 확립합니다. 이 접근 방식은 희소성을 유지하면서 은닉층 크기를 효율적으로 확장할 수 있게 합니다. 해석 가능성을 더욱 강화하기 위해, 각 전문가 내에서 희소 활성화를 강제하고, 활성화 희소성이 가장 높은 전문가를 우선적으로 선택하도록 라우팅 메커니즘을 재설계합니다. 이러한 설계는 가장 중요한 특징만이 전문가에 의해 라우팅되고 처리되도록 보장합니다. MoE-X를 체스 및 자연어 작업에서 평가한 결과, 밀집 모델과 비슷한 성능을 달성하면서도 해석 가능성을 크게 개선함을 보여줍니다. MoE-X는 GPT-2보다 더 나은 perplexity를 달성하며, 희소 오토인코더(SAE) 기반 접근법을 능가하는 해석 가능성을 보입니다.

English

Neurons in large language models often exhibit polysemanticity, simultaneously encoding multiple unrelated concepts and obscuring interpretability. Instead of relying on post-hoc methods, we present MoE-X, a Mixture-of-Experts (MoE) language model designed to be intrinsically interpretable. Our approach is motivated by the observation that, in language models, wider networks with sparse activations are more likely to capture interpretable factors. However, directly training such large sparse networks is computationally prohibitive. MoE architectures offer a scalable alternative by activating only a subset of experts for any given input, inherently aligning with interpretability objectives. In MoE-X, we establish this connection by rewriting the MoE layer as an equivalent sparse, large MLP. This approach enables efficient scaling of the hidden size while maintaining sparsity. To further enhance interpretability, we enforce sparse activation within each expert and redesign the routing mechanism to prioritize experts with the highest activation sparsity. These designs ensure that only the most salient features are routed and processed by the experts. We evaluate MoE-X on chess and natural language tasks, showing that it achieves performance comparable to dense models while significantly improving interpretability. MoE-X achieves a perplexity better than GPT-2, with interpretability surpassing even sparse autoencoder (SAE)-based approaches.

본질적으로 해석 가능한 전문가 혼합 모델

Mixture of Experts Made Intrinsically Interpretable

초록

Support