GateBreaker:针对专家混合大语言模型的门控导向攻击
GateBreaker: Gate-Guided Attacks on Mixture-of-Expert LLMs
December 24, 2025
作者: Lichao Wu, Sasha Behrouzi, Mohamadreza Rostami, Stjepan Picek, Ahmad-Reza Sadeghi
cs.AI
摘要
混合专家(MoE)架构通过仅激活每个输入的稀疏参数子集,推动了大语言模型(LLM)的规模化发展,在降低计算成本的同时实现了顶尖性能。随着此类模型在关键领域的广泛应用,理解并强化其对齐机制对于防止有害输出至关重要。然而现有的大模型安全研究几乎全部集中于稠密架构,致使MoE模型独特的安全特性尚未得到系统审视。MoE的模块化稀疏激活设计表明,其安全机制可能以不同于稠密模型的方式运作,这引发了关于其鲁棒性的新思考。
本文提出GateBreaker——首个无需训练、轻量化且架构无关的攻击框架,可在推理阶段破坏现代MoE大模型的安全对齐机制。该框架通过三阶段实现攻击:(i) 门控级分析,识别对有害输入存在异常路由偏好的安全专家;(ii) 专家级定位,在安全专家内部定位安全结构;(iii) 定向安全消除,通过禁用已识别的安全结构破坏模型对齐。研究表明,MoE的安全能力集中于由稀疏路由协调的少量神经元中。在目标专家层中选择性禁用约3%的神经元,即可使八大最新对齐MoE大模型的平均攻击成功率(ASR)从7.4%显著提升至64.9%,且效用衰减有限。这些安全神经元在同类模型间具备可迁移性,通过单样本迁移攻击可将ASR从17.9%提升至67.7%。此外,GateBreaker可泛化至五个MoE视觉语言模型(VLM),对不安全图像输入的ASR达到60.9%。
English
Mixture-of-Experts (MoE) architectures have advanced the scaling of Large Language Models (LLMs) by activating only a sparse subset of parameters per input, enabling state-of-the-art performance with reduced computational cost. As these models are increasingly deployed in critical domains, understanding and strengthening their alignment mechanisms is essential to prevent harmful outputs. However, existing LLM safety research has focused almost exclusively on dense architectures, leaving the unique safety properties of MoEs largely unexamined. The modular, sparsely-activated design of MoEs suggests that safety mechanisms may operate differently than in dense models, raising questions about their robustness.
In this paper, we present GateBreaker, the first training-free, lightweight, and architecture-agnostic attack framework that compromises the safety alignment of modern MoE LLMs at inference time. GateBreaker operates in three stages: (i) gate-level profiling, which identifies safety experts disproportionately routed on harmful inputs, (ii) expert-level localization, which localizes the safety structure within safety experts, and (iii) targeted safety removal, which disables the identified safety structure to compromise the safety alignment. Our study shows that MoE safety concentrates within a small subset of neurons coordinated by sparse routing. Selective disabling of these neurons, approximately 3% of neurons in the targeted expert layers, significantly increases the averaged attack success rate (ASR) from 7.4% to 64.9% against the eight latest aligned MoE LLMs with limited utility degradation. These safety neurons transfer across models within the same family, raising ASR from 17.9% to 67.7% with one-shot transfer attack. Furthermore, GateBreaker generalizes to five MoE vision language models (VLMs) with 60.9% ASR on unsafe image inputs.