Sumi: 처음부터 구축한 오픈 균일 확산 언어 모델

초록

확산 모델은 자기회귀 모델의 유망한 대안으로 부상하고 있다. 그중에서도 균일 확산 언어 모델(UDLM)은 모든 토큰이 임의의 단계에서 업데이트될 수 있도록 허용함으로써, 원칙적으로 더 유연한 생성을 가능하게 한다. 그러나 아직까지 대규모 파라미터 규모와 대규모 토큰 예산 모두에서 처음부터 사전 학습된 UDLM은 존재하지 않는다. 자기회귀 모델링과 마스크 확산 모델링은 이미 연구 커뮤니티가 연구하고 발전시킬 수 있는 규모의 역량 있는 모델을 보유하고 있지만, 균일 확산 모델은 그러한 사례가 없다. 규모 면에서 처음부터 사전 학습된 UDLM은 스케일링 행동, 생성 역학, 제어 가능성, 그리고 기존의 자기회귀 및 마스크 확산 모델과의 상충 관계를 연구하기 위한 깔끔한 기준점을 제공할 것이다. 이러한 목적을 위해, 우리는 1.5T 토큰으로 처음부터 사전 학습된 완전 공개 7B 균일 확산 언어 모델인 Sumi(일본어로 "먹"을 의미)를 소개한다. Sumi는 지식, 추론 및 코딩 벤치마크에서 비교 가능한 토큰 예산으로 학습된 자기회귀 모델과 경쟁력 있는 성능을 보이는 반면, 상식 벤치마크에서는 다소 낮은 성능을 보이는데, 이는 교육 중심의 데이터 혼합이 주요 원인으로 추정된다. 우리는 모델 가중치, 체크포인트, 그리고 공개 코퍼스에 대한 데이터 혼합의 완전한 명세를 포함한 전체 학습 레시피를 공개한다. 이번 공개가 커뮤니티로 하여금 규모 면에서의 순수 균일 확산을 연구할 수 있게 하고, 아직 충분히 이해되지 않은 측면에 대한 연구를 촉진하는 계기가 되기를 바란다.

English

Diffusion models have become a promising alternative to autoregressive models. Among these, uniform diffusion language models (UDLMs) permit any token to be updated at any step, in principle enabling more flexible generation. However, no UDLM has yet been pretrained from scratch at both large parameter scale and large token budget. Both autoregressive modeling and masked diffusion modeling already have capable models at scale that the community can study and build on; uniform diffusion has none. A scratch-pretrained UDLM at scale would provide a clean reference point for studying scaling behavior, generation dynamics, controllability, and trade-offs against established autoregressive and masked diffusion models. To this end, we introduce Sumi ("ink" in Japanese), a fully open 7B uniform diffusion language model pretrained from scratch on 1.5T tokens. Sumi performs competitively with autoregressive models trained at comparable token budgets on knowledge, reasoning, and coding benchmarks, while under-performing on commonsense benchmarks, where our education-heavy data mixture is a likely contributor. We release our model weights, checkpoints, and full training recipe, including a complete specification of the data mixture over publicly available corpora. We hope this release enables the community to study native uniform diffusion at scale and catalyzes work on its as-yet poorly understood aspects.