MASCing: 활성화 스티어링 마스크를 통한 구성 가능한 전문가 혼합 동작

초록

대규모 언어 모델(LLM)의 전문가 혼합(MoE) 아키텍처는 희소 활성화를 통해 추론 비용을 크게 절감했습니다. 그러나 이러한 희소 활성화 패러다임은 새로운 안전성 문제도 야기합니다. 각 입력에 대해 전문가의 일부만 활성화되므로 모델 동작이 라우팅 결정에 결합되어, 안전성 관련 시나리오마다 달라질 수 있는 제어하기 어려운 메커니즘을 생성하기 때문입니다. 동시에 전체 미세 조정이나 재학습을 통해 모델 동작을 적응시키는 것은 비용이 많이 들며, 특히 개발자가 다양한 안전 목표를 위해 동일한 모델을 신속하게 구성해야 할 때는 더욱 그렇습니다. 본 논문은 재학습 없이 다양한 안전 시나리오에서 MoE 동작을 유연하게 재구성할 수 있는 최초의 프레임워크인 MASCing(MoE Activation Steering Configuration)을 제안합니다. MASCing은 LSTM 기반 서로게이트 모델을 사용하여 계층 간 라우팅 의존성을 포착하고 라우팅 로짓을 하류 작업 동작에 매핑합니다. 그런 다음 동작 관련 전문가 회로를 식별하기 위해 스티어링 행렬을 최적화하고, 추론 시에는 라우팅 게이트에 스티어링 마스크를 적용하여 전문가 선택을 재정의합니다. 이를 통해 일반적인 언어 유용성을 보존하면서 특정 동작을 대상으로 강화하거나 억제할 수 있습니다. 재구성 가능성을 입증하기 위해 MASCing을 두 가지 다른 안전 관련 목표에 적용하고 7개의 오픈소스 MoE 모델에서 무시할 수 있는 오버헤드로 일관된 성능 향상을 관찰했습니다. 다중 턴 재택공격(jailbreak) 방어에서는 평균 방어 성공률을 52.5%에서 83.9%로 개선했으며, 최대 89.2%의 향상을 보였습니다. 성인 콘텐츠 생성의 경우, MASCing을 적용하면 원래 거부되었을 요청을 모델이 준수하도록 하여 평균 생성 성공률을 52.6%에서 82.0%로 높였으며, 최대 93.0%의 향상을 달성했습니다. 이러한 결과는 MASCing이 MoE 모델에서 시나리오별 안전 재구성을 위한 실용적이고 경량이며 유연한 프레임워크임을 입증합니다.

English

Mixture-of-Experts (MoE) architectures in Large Language Models (LLMs) have significantly reduced inference costs through sparse activation. However, this sparse activation paradigm also introduces new safety challenges. Since only a subset of experts is engaged for each input, model behavior becomes coupled to routing decisions, yielding a difficult-to-control mechanism that can vary across safety-relevant scenarios. At the same time, adapting model behavior through full fine-tuning or retraining is costly, especially when developers need to rapidly configure the same model for different safety objectives. We present MASCing (MoE Activation Steering Configuration), the first framework that enables flexible reconfiguration of MoE behavior across diverse safety scenarios without retraining. MASCing uses an LSTM-based surrogate model to capture cross-layer routing dependencies and map routing logits to downstream behaviors. It then optimizes a steering matrix to identify behavior-relevant expert circuits and, at inference time, applies steering masks to the routing gates to override expert selection. This enables targeted enhancement or suppression of specific behaviors while preserving general language utility. To demonstrate its reconfigurability, we apply MASCing to two different safety-related objectives and observe consistent gains with negligible overhead across seven open-source MoE models. For multi-turn jailbreak defense, it improves the average defense success rate from 52.5% to 83.9%, with gains of up to 89.2%. For adult-content generation, MASCing enables models to comply with such requests that would otherwise be refused, increasing the average generation success rate from 52.6% to 82.0%, with gains of up to 93.0%. These results establish MASCing as a practical, lightweight, and flexible framework for scenario-specific safety reconfiguration in MoE models.

MASCing: 활성화 스티어링 마스크를 통한 구성 가능한 전문가 혼합 동작

MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks

초록

Support