重要点に注目：拡散MoEにおける顕著性を活用した正確なルーティング

要旨

Mixture-of-Experts（MoE）アーキテクチャは、視覚生成における拡散モデルをスケーリングするための強力なパラダイムとして登場した。近年の進展は、多様なトークン間で計算リソースを適応的に割り当て、効率性と性能を向上させることに焦点を当てている。しかし、既存の拡散MoEフレームワークにはルーティング割り当て問題が存在することを我々は特定した。すなわち、ルーターが顕著なトークンに対して正確に多くの計算リソースを割り当てることができないのである。この失敗の原因は、ルーターがノイズによって劣化した潜在特徴に依存していることにあると分析する。このような確率的ノイズは、重要な構造的・テクスチャ情報を不明瞭にし、ルーターが顕著なトークンを効果的に識別することを妨げる。この問題に対処するため、我々はSharpMoEを提案する。これは、清浄な潜在特徴をノイズのないガイダンス信号としてルーティングに利用する、顕著性を活用した正確なルーティング機構を持つポストトレーニングフレームワークである。ノイズによって歪められた入力を回避することにより、SharpMoEはルーターに明確な顕著性ガイダンスを提供し、高ノイズ段階であっても顕著なトークンを識別可能にする。さらに、マルチステップのノイズ除去軌跡全体にわたって計算割り当てを制約する軌跡ルーティング損失を導入し、生成ロールアウトに沿った正確なリソース割り当てを保証する。広範な実験により、SharpMoEは汎用的でプラグアンドプレイなソリューションとして機能し、事前学習済みで収束済みのMoEモデルをさらに強化し、視覚生成において最先端の性能を達成することを示す。

English

Mixture-of-Experts (MoE) architectures have emerged as a powerful paradigm for scaling diffusion models in visual generation. Recent advancements have focused on adaptively allocating computational resources across diverse tokens to improve efficiency and performance. However, we identify a routing assignment problem in existing diffusion MoE frameworks: the router fails to accurately allocate more computational resources to salient tokens. Our analysis attributes this failure to the router's reliance on noise-corrupted latent features throughout the denoising process. Such stochastic noise obscures the critical structural and textural information, thereby preventing the router from effectively distinguishing salient tokens. To address this, we propose SharpMoE, a post-training framework with a saliency-harnessing accurate routing mechanism, which utilizes clean latent features as a noise-free guidance signal for routing. By bypassing the noise-distorted inputs, SharpMoE provides the router with clear saliency guidance, enabling the identification of salient tokens even in high-noise stages. Furthermore, we introduce a trajectory routing loss to constrain the compute allocation throughout the multi-step denoising trajectory, ensuring precise resource allocation along the generation rollout. Extensive experiments demonstrate that SharpMoE serves as a versatile, plug-and-play solution that further enhances the pretrained, converged MoE models, achieving state-of-the-art performance in visual generation.