MASCing：基于激活导向掩码的可配置专家混合行为调控

摘要

大型语言模型（LLMs）中的混合专家（MoE）架构通过稀疏激活显著降低了推理成本。然而，这种稀疏激活范式也带来了新的安全挑战。由于每个输入仅激活部分专家，模型行为与路由决策形成耦合，产生了一种难以控制的机制，且可能随安全相关场景动态变化。与此同时，通过全参数微调或重新训练来调整模型行为的成本高昂，尤其在开发者需要为不同安全目标快速配置同一模型时更为突出。我们提出MASCing（MoE激活导向配置）框架，这是首个无需重新训练即可在多样化安全场景中灵活重构MoE行为的方案。MASCing采用基于LSTM的代理模型捕捉跨层路由依赖关系，将路由逻辑映射至下游行为；随后通过优化导向矩阵识别行为相关的专家电路，并在推理时对路由门施加导向掩码以重写专家选择。该机制能在保持通用语言能力的同时，针对性增强或抑制特定行为。为验证其可重构性，我们将MASCing应用于两项不同的安全目标，在七个开源MoE模型中观察到性能的稳定提升且开销可忽略。在多轮越狱防御任务中，平均防御成功率从52.5%提升至83.9%，最高达89.2%；在成人内容生成场景中，模型对原本会拒绝的请求实现合规响应，平均生成成功率从52.6%提升至82.0%，最高达93.0%。这些结果证明MASCing是一种实用、轻量且灵活的MoE模型场景化安全重构框架。

English

Mixture-of-Experts (MoE) architectures in Large Language Models (LLMs) have significantly reduced inference costs through sparse activation. However, this sparse activation paradigm also introduces new safety challenges. Since only a subset of experts is engaged for each input, model behavior becomes coupled to routing decisions, yielding a difficult-to-control mechanism that can vary across safety-relevant scenarios. At the same time, adapting model behavior through full fine-tuning or retraining is costly, especially when developers need to rapidly configure the same model for different safety objectives. We present MASCing (MoE Activation Steering Configuration), the first framework that enables flexible reconfiguration of MoE behavior across diverse safety scenarios without retraining. MASCing uses an LSTM-based surrogate model to capture cross-layer routing dependencies and map routing logits to downstream behaviors. It then optimizes a steering matrix to identify behavior-relevant expert circuits and, at inference time, applies steering masks to the routing gates to override expert selection. This enables targeted enhancement or suppression of specific behaviors while preserving general language utility. To demonstrate its reconfigurability, we apply MASCing to two different safety-related objectives and observe consistent gains with negligible overhead across seven open-source MoE models. For multi-turn jailbreak defense, it improves the average defense success rate from 52.5% to 83.9%, with gains of up to 89.2%. For adult-content generation, MASCing enables models to comply with such requests that would otherwise be refused, increasing the average generation success rate from 52.6% to 82.0%, with gains of up to 93.0%. These results establish MASCing as a practical, lightweight, and flexible framework for scenario-specific safety reconfiguration in MoE models.

MASCing：基于激活导向掩码的可配置专家混合行为调控

MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks

摘要

Support