Sumi: スクラッチから構築したオープンユニフォーム拡散言語モデル

要旨

拡散モデルは、自己回帰モデルに代わる有望な手法として台頭してきている。中でも一様拡散言語モデル(UDLM)は、任意のトークンを任意のステップで更新できるため、原理的により柔軟な生成を可能にする。しかしながら、大規模なパラメータ数と大規模なトークン予算の両方において、スクラッチから事前学習されたUDLMはこれまで存在しなかった。自己回帰モデリングとマスク拡散モデリングには、コミュニティが研究・発展の基盤とできる高性能モデルが既にスケールして存在している一方、一様拡散にはそれが無い。スクラッチから大規模に事前学習されたUDLMは、スケーリング挙動、生成ダイナミクス、制御可能性、そして既存の自己回帰モデルやマスク拡散モデルとのトレードオフを研究するためのクリーンな参照点となる。この目的のため、我々はSumi（日本語で「墨」）を発表する。これは完全に公開された7Bパラメータの一様拡散言語モデルであり、1.5Tトークンを用いてスクラッチから事前学習された。Sumiは、知識・推論・コーディングの各ベンチマークにおいて、同程度のトークン予算で学習された自己回帰モデルと競争力のある性能を示す一方、常識推論ベンチマークでは劣る結果となった。この背景には、教育データを重視したデータ混合比率が寄与していると考えられる。我々はモデルの重み、チェックポイント、そして公開コーパス上のデータ混合比率の完全な仕様を含む学習レシピのすべてを公開する。この公開が、本来の意味での一様拡散を大規模に研究するコミュニティの取り組みを促進し、未だ十分に理解されていないその諸側面への研究を触媒することを期待する。

English

Diffusion models have become a promising alternative to autoregressive models. Among these, uniform diffusion language models (UDLMs) permit any token to be updated at any step, in principle enabling more flexible generation. However, no UDLM has yet been pretrained from scratch at both large parameter scale and large token budget. Both autoregressive modeling and masked diffusion modeling already have capable models at scale that the community can study and build on; uniform diffusion has none. A scratch-pretrained UDLM at scale would provide a clean reference point for studying scaling behavior, generation dynamics, controllability, and trade-offs against established autoregressive and masked diffusion models. To this end, we introduce Sumi ("ink" in Japanese), a fully open 7B uniform diffusion language model pretrained from scratch on 1.5T tokens. Sumi performs competitively with autoregressive models trained at comparable token budgets on knowledge, reasoning, and coding benchmarks, while under-performing on commonsense benchmarks, where our education-heavy data mixture is a likely contributor. We release our model weights, checkpoints, and full training recipe, including a complete specification of the data mixture over publicly available corpora. We hope this release enables the community to study native uniform diffusion at scale and catalyzes work on its as-yet poorly understood aspects.