ChatPaper.aiChatPaper

專家混合模型的內在可解釋性

Mixture of Experts Made Intrinsically Interpretable

March 5, 2025
作者: Xingyi Yang, Constantin Venhoff, Ashkan Khakzar, Christian Schroeder de Witt, Puneet K. Dokania, Adel Bibi, Philip Torr
cs.AI

摘要

大型語言模型中的神經元往往表現出多義性,同時編碼多個不相關的概念,從而模糊了可解釋性。我們提出了MoE-X,這是一種專為內在可解釋性設計的混合專家(Mixture-of-Experts, MoE)語言模型,而非依賴於事後方法。我們的方法基於這樣的觀察:在語言模型中,具有稀疏激活的寬網絡更有可能捕捉到可解釋的因素。然而,直接訓練如此大規模的稀疏網絡在計算上是不可行的。MoE架構通過僅為任何給定輸入激活一部分專家,提供了一種可擴展的替代方案,這與可解釋性目標本質上是一致的。在MoE-X中,我們通過將MoE層重寫為等效的稀疏大型多層感知器(MLP)來建立這種聯繫。這種方法能夠在保持稀疏性的同時高效地擴展隱藏層大小。為了進一步增強可解釋性,我們在每個專家內部強制實施稀疏激活,並重新設計路由機制,以優先選擇激活稀疏度最高的專家。這些設計確保了只有最顯著的特徵才會被路由並由專家處理。我們在國際象棋和自然語言任務上評估了MoE-X,結果顯示它在保持與密集模型相當的性能的同時,顯著提高了可解釋性。MoE-X的困惑度優於GPT-2,其可解釋性甚至超越了基於稀疏自編碼器(SAE)的方法。
English
Neurons in large language models often exhibit polysemanticity, simultaneously encoding multiple unrelated concepts and obscuring interpretability. Instead of relying on post-hoc methods, we present MoE-X, a Mixture-of-Experts (MoE) language model designed to be intrinsically interpretable. Our approach is motivated by the observation that, in language models, wider networks with sparse activations are more likely to capture interpretable factors. However, directly training such large sparse networks is computationally prohibitive. MoE architectures offer a scalable alternative by activating only a subset of experts for any given input, inherently aligning with interpretability objectives. In MoE-X, we establish this connection by rewriting the MoE layer as an equivalent sparse, large MLP. This approach enables efficient scaling of the hidden size while maintaining sparsity. To further enhance interpretability, we enforce sparse activation within each expert and redesign the routing mechanism to prioritize experts with the highest activation sparsity. These designs ensure that only the most salient features are routed and processed by the experts. We evaluate MoE-X on chess and natural language tasks, showing that it achieves performance comparable to dense models while significantly improving interpretability. MoE-X achieves a perplexity better than GPT-2, with interpretability surpassing even sparse autoencoder (SAE)-based approaches.

Summary

AI-Generated Summary

PDF82March 12, 2025