Sumi：從頭構建的開源均勻擴散語言模型

摘要

擴散模型已成為自回歸模型之外一個極具前景的替代方案。其中，均勻擴散語言模型（UDLM）允許在任何步驟更新任何詞元，原則上能實現更靈活的生成。然而，目前尚無任何UDLM在大型參數規模與大量詞元預算下從頭進行預訓練。自回歸建模與遮罩擴散建模均已具備規模化且可供學術社群研究與借鑑的模型，但均勻擴散模型則無。從頭預訓練的大規模UDLM，能為研究擴展行為、生成動態、可控性，以及與既有自回歸與遮罩擴散模型之間的權衡取捨，提供清晰的參考基準。為此，我們提出Sumi（日語意為「墨」），這是一個完全開源的7B參數均勻擴散語言模型，從頭在1.5T詞元上進行預訓練。在知識、推理與程式碼基準測試中，Sumi的表現與在相當詞元預算下訓練的自回歸模型相比毫不遜色，但在常識基準測試上表現較弱，而我們以教育為主的資料混合策略很可能是影響因素之一。我們公開了模型權重、檢查點以及完整的訓練配方，包括公開語料庫資料混合的完整規格。我們希望此釋出能讓學術社群得以研究大規模原生均勻擴散模型，並催化對其至今仍理解不足的面向進行深入探討。

English

Diffusion models have become a promising alternative to autoregressive models. Among these, uniform diffusion language models (UDLMs) permit any token to be updated at any step, in principle enabling more flexible generation. However, no UDLM has yet been pretrained from scratch at both large parameter scale and large token budget. Both autoregressive modeling and masked diffusion modeling already have capable models at scale that the community can study and build on; uniform diffusion has none. A scratch-pretrained UDLM at scale would provide a clean reference point for studying scaling behavior, generation dynamics, controllability, and trade-offs against established autoregressive and masked diffusion models. To this end, we introduce Sumi ("ink" in Japanese), a fully open 7B uniform diffusion language model pretrained from scratch on 1.5T tokens. Sumi performs competitively with autoregressive models trained at comparable token budgets on knowledge, reasoning, and coding benchmarks, while under-performing on commonsense benchmarks, where our education-heavy data mixture is a likely contributor. We release our model weights, checkpoints, and full training recipe, including a complete specification of the data mixture over publicly available corpora. We hope this release enables the community to study native uniform diffusion at scale and catalyzes work on its as-yet poorly understood aspects.