ChatPaper.aiChatPaper

GateBreaker:針對專家混合大語言模型的閘門導向攻擊 (注:Mixture-of-Expert在中文技術社群常譯為「專家混合」,LLMs標準譯法為「大語言模型」。標題採用破折號連接主副標題,符合中文學術論文標題規範。)

GateBreaker: Gate-Guided Attacks on Mixture-of-Expert LLMs

December 24, 2025
作者: Lichao Wu, Sasha Behrouzi, Mohamadreza Rostami, Stjepan Picek, Ahmad-Reza Sadeghi
cs.AI

摘要

專家混合架構透過僅針對每個輸入啟動稀疏的參數子集,推動了大語言模型的規模化發展,在降低計算成本的同時實現了尖端性能。隨著這類模型在關鍵領域的部署日益增多,理解並強化其對齊機制對於防止有害輸出至關重要。然而現有的大語言模型安全研究幾乎完全聚焦於稠密架構,致使專家混合模型獨特的安全特性長期未被系統性探究。專家混合模型的模組化稀疏激活設計表明,其安全機制的運作方式可能與稠密模型存在差異,這引發了關於其魯棒性的新課題。 本文提出首個免訓練、輕量級且架構無關的攻擊框架GateBreaker,該框架能在推理階段破壞現代專家混合大語言模型的安全對齊機制。GateBreaker的攻擊流程包含三階段:(i) 閘門層級剖析:識別在有害輸入上被過度路由的安全專家;(ii) 專家層級定位:在安全專家內部定位安全結構;(iii) 定向安全移除:通過禁用已識別的安全結構來破壞安全對齊。研究發現,專家混合模型的安全機制集中於由稀疏路由協調的少量神經元子集。選擇性禁用目標專家層中約3%的神經元後,對八款最新對齊的專家混合大語言模型的平均攻擊成功率從7.4%顯著提升至64.9%,且實用性下降有限。這些安全神經元在相同模型家族內具備可遷移性,單次遷移攻擊即可將攻擊成功率從17.9%提升至67.7%。此外,GateBreaker可泛化至五款專家混合視覺語言模型,對不安全圖像輸入的攻擊成功率達60.9%。
English
Mixture-of-Experts (MoE) architectures have advanced the scaling of Large Language Models (LLMs) by activating only a sparse subset of parameters per input, enabling state-of-the-art performance with reduced computational cost. As these models are increasingly deployed in critical domains, understanding and strengthening their alignment mechanisms is essential to prevent harmful outputs. However, existing LLM safety research has focused almost exclusively on dense architectures, leaving the unique safety properties of MoEs largely unexamined. The modular, sparsely-activated design of MoEs suggests that safety mechanisms may operate differently than in dense models, raising questions about their robustness. In this paper, we present GateBreaker, the first training-free, lightweight, and architecture-agnostic attack framework that compromises the safety alignment of modern MoE LLMs at inference time. GateBreaker operates in three stages: (i) gate-level profiling, which identifies safety experts disproportionately routed on harmful inputs, (ii) expert-level localization, which localizes the safety structure within safety experts, and (iii) targeted safety removal, which disables the identified safety structure to compromise the safety alignment. Our study shows that MoE safety concentrates within a small subset of neurons coordinated by sparse routing. Selective disabling of these neurons, approximately 3% of neurons in the targeted expert layers, significantly increases the averaged attack success rate (ASR) from 7.4% to 64.9% against the eight latest aligned MoE LLMs with limited utility degradation. These safety neurons transfer across models within the same family, raising ASR from 17.9% to 67.7% with one-shot transfer attack. Furthermore, GateBreaker generalizes to five MoE vision language models (VLMs) with 60.9% ASR on unsafe image inputs.
PDF01January 1, 2026