ChatPaper.aiChatPaper

三模态掩码扩散模型的设计空间

The Design Space of Tri-Modal Masked Diffusion Models

February 25, 2026
作者: Louis Bethune, Victor Turrisi, Bruno Kacper Mlodozeniec, Pau Rodriguez Lopez, Lokesh Boominathan, Nikhil Bhendawade, Amitis Shidani, Joris Pelemans, Theo X. Olausson, Devon Hjelm, Paul Dixon, Joao Monteiro, Pierre Ablin, Vishnu Banna, Arno Blaas, Nick Henderson, Kari Noriy, Dan Busbridge, Josh Susskind, Marco Cuturi, Irina Belousova, Luca Zappella, Russ Webb, Jason Ramapuram
cs.AI

摘要

离散扩散模型已成为自回归语言模型的有力替代方案,近期研究通过初始化和微调基础单模态模型实现了双模态生成。与既有方法不同,我们首次提出了从零开始预训练的文本、图文、音频-文本三模态掩码扩散模型。我们系统分析了多模态缩放定律、模态混合比例、噪声调度和批次大小效应,并提供了优化的推理采样默认设置。通过批次大小分析,我们提出了一种基于随机微分方程(SDE)的重新参数化方法,无需如近期研究所述手动调整最优批次大小。该重新参数化将物理批次大小(通常基于计算约束如GPU饱和度、浮点运算效率、挂钟时间确定)与逻辑批次大小(为平衡随机优化中的梯度方差而选择)解耦。最后,我们在6.4万亿token上预训练了初步的30亿参数三模态模型,展示了统一架构的潜力,并在文本生成、文生图及文生语音任务中取得优异效果。本研究是迄今规模最大的多模态离散扩散模型系统性开放研究,为跨多模态的缩放规律提供了重要洞见。
English
Discrete diffusion models have emerged as strong alternatives to autoregressive language models, with recent work initializing and fine-tuning a base unimodal model for bimodal generation. Diverging from previous approaches, we introduce the first tri-modal masked diffusion model pretrained from scratch on text, image-text, and audio-text data. We systematically analyze multimodal scaling laws, modality mixing ratios, noise schedules, and batch-size effects, and we provide optimized inference sampling defaults. Our batch-size analysis yields a novel stochastic differential equation (SDE)-based reparameterization that eliminates the need for tuning the optimal batch size as reported in recent work. This reparameterization decouples the physical batch size, often chosen based on compute constraints (GPU saturation, FLOP efficiency, wall-clock time), from the logical batch size, chosen to balance gradient variance during stochastic optimization. Finally, we pretrain a preliminary 3B-parameter tri-modal model on 6.4T tokens, demonstrating the capabilities of a unified design and achieving strong results in text generation, text-to-image tasks, and text-to-speech tasks. Our work represents the largest-scale systematic open study of multimodal discrete diffusion models conducted to date, providing insights into scaling behaviors across multiple modalities.
PDF31February 27, 2026