거대 언어 모델 로보토미: 전문가 침묵을 통한 전문가 혼합 구조 탈옥

초록

전문가 혼합(Mixture-of-Experts, MoE) 아키텍처의 빠른 도입은 대규모 언어 모델(LLM) 배포에 있어 중요한 전환점을 나타냅니다. MoE LLM은 토큰당 매개변수의 일부만 활성화하여 확장 효율성을 향상시키지만, 그 라우팅 구조는 새로운 안전 공격 표면을 도입합니다. 본 연구에서는 MoE LLM의 안전 관련 동작(예: 거절 응답)이 균일하게 분포되지 않고 소수의 전문가 집단에 집중되어 있음을 발견했습니다. 이를 바탕으로 우리는 라우팅 동역학을 활용하여 안전 정렬을 손상시키는 학습이 필요 없고 아키텍처에 구애받지 않는 공격 기법인 Large Language Lobotomy(L^3)을 제안합니다. L^3은 거절 응답과 상관관계가 있는 라우팅 패턴을 학습하고, 안전 동작을 특정 전문가에 귀속시킨 후, 유해한 출력이 생성될 때까지 가장 안전 관련성이 높은 전문가를 적응적으로 침묵시킵니다. 우리는 8개의 최첨단 오픈소스 MoE LLM에 대해 L^3을 평가했으며, 이 적응형 전문가 침묵 기법이 기존의 학습이 필요 없는 MoE 탈옥 방법을 능가하며 평균 공격 성공률을 7.3%에서 70.4%로, 최대 86.3%까지 증가시킨다는 것을 보여줍니다. 더욱이, 안전 장치를 우회하는 데는 일반적인 언어 유틸리티를 대부분 유지하면서 계층별 전문가의 20% 미만을 침묵시키는 것으로 충분했습니다. 이러한 결과는 효율성 중심의 MoE 설계와 강력한 안전 정렬 사이의 근본적인 긴장 관계를 드러내며, 향후 MoE LLM에서는 아키텍처 및 라우팅 인식 방법을 통해 안전 메커니즘을 보다 강력하게 분산시켜야 할 필요성을 시사합니다.

English

The rapid adoption of Mixture-of-Experts (MoE) architectures marks a major shift in the deployment of Large Language Models (LLMs). MoE LLMs improve scaling efficiency by activating only a small subset of parameters per token, but their routing structure introduces new safety attack surfaces. We find that safety-critical behaviors in MoE LLMs (e.g., refusal) are concentrated in a small set of experts rather than being uniformly distributed. Building on this, we propose Large Language Lobotomy (L^3), a training-free, architecture-agnostic attack that compromises safety alignment by exploiting expert routing dynamics. L^3 learns routing patterns that correlate with refusal, attributes safety behavior to specific experts, and adaptively silences the most safety-relevant experts until harmful outputs are produced. We evaluate L^3 on eight state-of-the-art open-source MoE LLMs and show that our adaptive expert silencing increases average attack success from 7.3% to 70.4%, reaching up to 86.3%, outperforming prior training-free MoE jailbreak methods. Moreover, bypassing guardrails typically requires silencing fewer than 20% of layer-wise experts while largely preserving general language utility. These results reveal a fundamental tension between efficiency-driven MoE design and robust safety alignment and motivate distributing safety mechanisms more robustly in future MoE LLMs with architecture- and routing-aware methods.

거대 언어 모델 로보토미: 전문가 침묵을 통한 전문가 혼합 구조 탈옥

Large Language Lobotomy: Jailbreaking Mixture-of-Experts via Expert Silencing

초록

Support