MASCing：活性化ステアリングマスクによる設定可能な混合専門家の振る舞い

要旨

大規模言語モデル（LLM）におけるMixture-of-Experts（MoE）アーキテクチャは、スパース活性化により推論コストを大幅に削減してきた。しかし、このスパース活性化のパラダイムは新たな安全性課題ももたらす。各入力で専門家の一部のみが活性化されるため、モデルの振る舞いは経路選択に依存し、安全性に関連する様々なシナリオ間で変動する制御困難なメカニズムを生み出す。一方、モデルの振る舞いをフルファインチューニングや再学習で適応させることはコストが高く、開発者が異なる安全性目標に対して同一モデルを迅速に設定する必要がある場合に特に問題となる。本論文ではMASCing（MoE Activation Steering Configuration）を提案する。これは、再学習なしで多様な安全性シナリオにわたるMoEの振る舞いを柔軟に再構成可能にする初のフレームワークである。MASCingはLSTMベースの代理モデルを用いて層間の経路依存関係を捕捉し、経路ロジットを下流の振る舞いにマッピングする。その後、振る舞いに関連する専門家回路を特定するためにステアリング行列を最適化し、推論時には経路ゲートにステアリングマスクを適用して専門家選択を上書きする。これにより、一般的な言語能力を維持しつつ、特定の振る舞いを標的に強化または抑制することが可能となる。再構成性を実証するため、MASCingを2つの異なる安全性関連目標に適用し、7つのオープンソースMoEモデルで無視可能なオーバーヘッドで一貫した効果を確認した。マルチターン jailbreak 防御では、平均防御成功率を52.5%から83.9%に向上させ（最大89.2%）、成人向けコンテンツ生成では、本来拒否されるリクエストへの対応を可能にし、平均生成成功率を52.6%から82.0%に向上させた（最大93.0%）。これらの結果は、MASCingがMoEモデルにおけるシナリオ特化型安全性再構成の実用的で軽量かつ柔軟なフレームワークであることを示している。

English

Mixture-of-Experts (MoE) architectures in Large Language Models (LLMs) have significantly reduced inference costs through sparse activation. However, this sparse activation paradigm also introduces new safety challenges. Since only a subset of experts is engaged for each input, model behavior becomes coupled to routing decisions, yielding a difficult-to-control mechanism that can vary across safety-relevant scenarios. At the same time, adapting model behavior through full fine-tuning or retraining is costly, especially when developers need to rapidly configure the same model for different safety objectives. We present MASCing (MoE Activation Steering Configuration), the first framework that enables flexible reconfiguration of MoE behavior across diverse safety scenarios without retraining. MASCing uses an LSTM-based surrogate model to capture cross-layer routing dependencies and map routing logits to downstream behaviors. It then optimizes a steering matrix to identify behavior-relevant expert circuits and, at inference time, applies steering masks to the routing gates to override expert selection. This enables targeted enhancement or suppression of specific behaviors while preserving general language utility. To demonstrate its reconfigurability, we apply MASCing to two different safety-related objectives and observe consistent gains with negligible overhead across seven open-source MoE models. For multi-turn jailbreak defense, it improves the average defense success rate from 52.5% to 83.9%, with gains of up to 89.2%. For adult-content generation, MASCing enables models to comply with such requests that would otherwise be refused, increasing the average generation success rate from 52.6% to 82.0%, with gains of up to 93.0%. These results establish MASCing as a practical, lightweight, and flexible framework for scenario-specific safety reconfiguration in MoE models.

MASCing：活性化ステアリングマスクによる設定可能な混合専門家の振る舞い

MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks

要旨

Support