ChatPaper.aiChatPaper

状态混合:面向多模态生成的词元级动态路由

Mixture of States: Routing Token-Level Dynamics for Multimodal Generation

November 15, 2025
作者: Haozhe Liu, Ding Liu, Mingchen Zhuge, Zijian Zhou, Tian Xie, Sen He, Yukang Yang, Shuming Liu, Yuren Cong, Jiadong Guo, Hongyu Xu, Ke Xu, Kam-Woh Ng, Juan C. Pérez, Juan-Manuel~Pérez-Rúa, Tao Xiang, Wei Liu, Shikun Liu, Jürgen Schmidhuber
cs.AI

摘要

我们提出状态混合(MoS),一种面向多模态扩散模型的新型融合范式,通过基于状态的灵活交互实现模态融合。MoS的核心是可学习的令牌级路由器,它在不同模态的隐状态间建立去噪时间步长与输入依赖的交互关系,将令牌级特征与扩散轨迹精准对齐。该路由器通过稀疏选择Top-k隐状态,并采用ε-贪婪策略进行训练,能以极少的可学习参数和可忽略的计算开销高效选取上下文特征。我们在文本到图像生成(MoS-Image)和编辑(MoS-Editing)任务上验证了该设计,取得了最先进的性能。仅需30亿至50亿参数,我们的模型即可媲美甚至超越参数规模达4倍以上的同类模型。这些发现确立了MoS作为可扩展多模态扩散模型的灵活且计算高效的范式。
English
We introduce MoS (Mixture of States), a novel fusion paradigm for multimodal diffusion models that merges modalities using flexible, state-based interactions. The core of MoS is a learnable, token-wise router that creates denoising timestep- and input-dependent interactions between modalities' hidden states, precisely aligning token-level features with the diffusion trajectory. This router sparsely selects the top-k hidden states and is trained with an ε-greedy strategy, efficiently selecting contextual features with minimal learnable parameters and negligible computational overhead. We validate our design with text-to-image generation (MoS-Image) and editing (MoS-Editing), which achieve state-of-the-art results. With only 3B to 5B parameters, our models match or surpass counterparts up to 4times larger. These findings establish MoS as a flexible and compute-efficient paradigm for scaling multimodal diffusion models.
PDF62December 2, 2025