大規模言語モデル前頭葉切除術：エキスパートサイレンシングによるMoEの脱獄

要旨

混合専門家（Mixture-of-Experts: MoE）アーキテクチャの急速な普及は、大規模言語モデル（LLM）の展開における大きな転換点を示している。MoE LLMはトークンごとにごく一部のパラメータのみを活性化することでスケーリング効率を向上させるが、そのルーティング構造は新たなセキュリティ攻撃の表面を導入する。本研究では、MoE LLMにおける安全性重視の振る舞い（例：拒否応答）が均一に分布するのではなく、少数の専門家群に集中していることを明らかにする。この知見に基づき、我々は訓練不要でアーキテクチャに依存しない攻撃手法「Large Language Lobotomy (L^3)」を提案する。本手法は専門家のルーティング動態を悪用して安全性調整を侵害する。L^3は拒否応答と相関するルーティングパターンを学習し、安全性の振る舞いを特定の専門家に帰属させ、有害な出力が生成されるまで最も安全性に関連する専門家を適応的に沈黙させる。我々は8つの最先端オープンソースMoE LLMでL^3を評価し、本適応的専門家沈黙化により平均攻撃成功率が7.3%から70.4%に向上し、最大86.3%に達し、従来の訓練不要MoE jailbreak手法を凌駕することを示す。さらに、ガードレールの回避には通常、層ごとの専門家の20%未満の沈黙化のみを要し、一般的な言語機能は大部分が維持される。これらの結果は、効率性を重視したMoE設計と堅牢な安全性調整の間の根本的な緊張関係を明らかにし、将来のMoE LLMにおいて、アーキテクチャおよびルーティングを意識した手法により安全性機構をより堅牢に分散させる必要性を示唆する。

English

The rapid adoption of Mixture-of-Experts (MoE) architectures marks a major shift in the deployment of Large Language Models (LLMs). MoE LLMs improve scaling efficiency by activating only a small subset of parameters per token, but their routing structure introduces new safety attack surfaces. We find that safety-critical behaviors in MoE LLMs (e.g., refusal) are concentrated in a small set of experts rather than being uniformly distributed. Building on this, we propose Large Language Lobotomy (L^3), a training-free, architecture-agnostic attack that compromises safety alignment by exploiting expert routing dynamics. L^3 learns routing patterns that correlate with refusal, attributes safety behavior to specific experts, and adaptively silences the most safety-relevant experts until harmful outputs are produced. We evaluate L^3 on eight state-of-the-art open-source MoE LLMs and show that our adaptive expert silencing increases average attack success from 7.3% to 70.4%, reaching up to 86.3%, outperforming prior training-free MoE jailbreak methods. Moreover, bypassing guardrails typically requires silencing fewer than 20% of layer-wise experts while largely preserving general language utility. These results reveal a fundamental tension between efficiency-driven MoE design and robust safety alignment and motivate distributing safety mechanisms more robustly in future MoE LLMs with architecture- and routing-aware methods.

大規模言語モデル前頭葉切除術：エキスパートサイレンシングによるMoEの脱獄

Large Language Lobotomy: Jailbreaking Mixture-of-Experts via Expert Silencing

要旨

Support