Sumi:从零开始构建的开放均匀扩散语言模型
Sumi: Open Uniform Diffusion Language Model from Scratch
June 17, 2026
作者: Mengyu Ye, Keito Kudo, Wataru Ikeda, Ryosuke Matsuda, Keisuke Sakaguchi, Jun Suzuki
cs.AI
摘要
扩散模型已成为自回归模型的一种有前景的替代方案。其中,均匀扩散语言模型(UDLM)允许在任何步骤更新任意令牌,原则上可实现更灵活的生成。然而,目前尚未有UDLM在参数量级和令牌预算都较大的情况下从零开始预训练。自回归建模和掩码扩散建模在较大规模上已有可供社区研究和构建的模型,而均匀扩散模型则缺少此类模型。一个从零开始在大规模上预训练的UDLM,将为研究其缩放行为、生成动态、可控性以及与现有自回归和掩码扩散模型的权衡提供清晰的参考点。为此,我们推出了Sumi(日语中意为“墨水”),这是一个完全开源的70亿参数均匀扩散语言模型,从零开始使用1.5T个令牌进行预训练。在知识、推理和编程基准测试中,Sumi与使用相当令牌预算训练的自回归模型表现相当,但在常识基准测试中表现略逊一筹,这可能与我们所采用的重教育类数据混合策略有关。我们公开了模型权重、检查点以及完整的训练方案,包括基于公开语料库的数据混合详细说明。希望这一开源能够推动社区对原生均匀扩散在规模上的研究,并促进对其尚不明确特性的探索。
English
Diffusion models have become a promising alternative to autoregressive models. Among these, uniform diffusion language models (UDLMs) permit any token to be updated at any step, in principle enabling more flexible generation. However, no UDLM has yet been pretrained from scratch at both large parameter scale and large token budget. Both autoregressive modeling and masked diffusion modeling already have capable models at scale that the community can study and build on; uniform diffusion has none. A scratch-pretrained UDLM at scale would provide a clean reference point for studying scaling behavior, generation dynamics, controllability, and trade-offs against established autoregressive and masked diffusion models. To this end, we introduce Sumi ("ink" in Japanese), a fully open 7B uniform diffusion language model pretrained from scratch on 1.5T tokens. Sumi performs competitively with autoregressive models trained at comparable token budgets on knowledge, reasoning, and coding benchmarks, while under-performing on commonsense benchmarks, where our education-heavy data mixture is a likely contributor. We release our model weights, checkpoints, and full training recipe, including a complete specification of the data mixture over publicly available corpora. We hope this release enables the community to study native uniform diffusion at scale and catalyzes work on its as-yet poorly understood aspects.