MASCing：基于激活导向掩码的可配置专家混合行为调控

摘要

大型语言模型中的混合专家架构通过稀疏激活显著降低了推理成本，但这种稀疏激活范式也带来了新的安全挑战。由于每个输入仅激活部分专家，模型行为与路由决策形成耦合，产生了一种难以控制的机制，且可能随安全相关场景动态变化。与此同时，通过全参数微调或重新训练来调整模型行为的成本高昂，尤其在开发者需要为不同安全目标快速配置同一模型时更为突出。我们提出MASCing框架——首个无需重新训练即可实现混合专家模型跨安全场景灵活重构的方案。该框架采用基于LSTM的代理模型捕捉跨层路由依赖关系，将路由逻辑映射至下游行为；通过优化导向矩阵识别行为相关的专家电路，并在推理时对路由门施加导向掩码以重写专家选择。这种方法能在保持通用语言能力的同时，针对性增强或抑制特定行为。为验证其可重构性，我们在七个开源混合专家模型上针对两类安全目标进行测试，均以可忽略的开销实现稳定提升：针对多轮越狱攻击防御，平均防御成功率从52.5%提升至83.9%，最高达89.2%；针对成人内容生成场景，模型从原本拒绝转为合规响应，平均生成成功率从52.6%升至82.0%，最高达93.0%。实验结果证明MASCing是一种实用、轻量且灵活的混合专家模型安全重构框架。

English

Mixture-of-Experts (MoE) architectures in Large Language Models (LLMs) have significantly reduced inference costs through sparse activation. However, this sparse activation paradigm also introduces new safety challenges. Since only a subset of experts is engaged for each input, model behavior becomes coupled to routing decisions, yielding a difficult-to-control mechanism that can vary across safety-relevant scenarios. At the same time, adapting model behavior through full fine-tuning or retraining is costly, especially when developers need to rapidly configure the same model for different safety objectives. We present MASCing (MoE Activation Steering Configuration), the first framework that enables flexible reconfiguration of MoE behavior across diverse safety scenarios without retraining. MASCing uses an LSTM-based surrogate model to capture cross-layer routing dependencies and map routing logits to downstream behaviors. It then optimizes a steering matrix to identify behavior-relevant expert circuits and, at inference time, applies steering masks to the routing gates to override expert selection. This enables targeted enhancement or suppression of specific behaviors while preserving general language utility. To demonstrate its reconfigurability, we apply MASCing to two different safety-related objectives and observe consistent gains with negligible overhead across seven open-source MoE models. For multi-turn jailbreak defense, it improves the average defense success rate from 52.5% to 83.9%, with gains of up to 89.2%. For adult-content generation, MASCing enables models to comply with such requests that would otherwise be refused, increasing the average generation success rate from 52.6% to 82.0%, with gains of up to 93.0%. These results establish MASCing as a practical, lightweight, and flexible framework for scenario-specific safety reconfiguration in MoE models.

MASCing：基于激活导向掩码的可配置专家混合行为调控

MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks

摘要

Support