ChatPaper.aiChatPaper

狀態混合:面向多模態生成的詞元級動態路由

Mixture of States: Routing Token-Level Dynamics for Multimodal Generation

November 15, 2025
作者: Haozhe Liu, Ding Liu, Mingchen Zhuge, Zijian Zhou, Tian Xie, Sen He, Yukang Yang, Shuming Liu, Yuren Cong, Jiadong Guo, Hongyu Xu, Ke Xu, Kam-Woh Ng, Juan C. Pérez, Juan-Manuel~Pérez-Rúa, Tao Xiang, Wei Liu, Shikun Liu, Jürgen Schmidhuber
cs.AI

摘要

我們提出狀態混合(MoS),這是一種創新的多模態擴散模型融合範式,通過靈活的基於狀態的交互實現模態融合。MoS的核心是具備可學習能力的詞元級路由器,能在去噪時間步長和輸入依賴的條件下,建立多模態隱藏狀態間的動態交互,從而精確對齊詞元級特徵與擴散軌跡。該路由器採用稀疏化的top-k隱藏狀態選取機制,並通過ε-greedy策略進行訓練,能以極少的可學習參數和可忽略的計算開銷高效選取上下文特徵。我們在文本到圖像生成(MoS-Image)與編輯(MoS-Editing)任務上驗證了該設計,其成果達到業界最先進水平。僅憑30億至50億參數規模,我們的模型即可媲美甚至超越參數量達4倍以上的同類模型。這些發現確立了MoS作為可擴展多模態擴散模型的靈活且計算高效的範式。
English
We introduce MoS (Mixture of States), a novel fusion paradigm for multimodal diffusion models that merges modalities using flexible, state-based interactions. The core of MoS is a learnable, token-wise router that creates denoising timestep- and input-dependent interactions between modalities' hidden states, precisely aligning token-level features with the diffusion trajectory. This router sparsely selects the top-k hidden states and is trained with an ε-greedy strategy, efficiently selecting contextual features with minimal learnable parameters and negligible computational overhead. We validate our design with text-to-image generation (MoS-Image) and editing (MoS-Editing), which achieve state-of-the-art results. With only 3B to 5B parameters, our models match or surpass counterparts up to 4times larger. These findings establish MoS as a flexible and compute-efficient paradigm for scaling multimodal diffusion models.
PDF62December 2, 2025