MDM-Prime-v2: 이진 인코딩과 인덱스 셔플링이 확산 언어 모델의 계산 최적 스케일링을 가능하게 하다

초록

마스크 확산 모델(MDM)은 부분 마스킹 기법(Prime)을 사용해 학습할 때 우수한 일반화 성능을 보입니다. 이 접근법은 토큰을 서브토큰으로 변환하고 확산 과정을 서브토큰 수준에서 모델링합니다. 본 연구에서는 MDM-Prime 프레임워크의 두 가지 한계를 확인했습니다. 첫째, 서브토크나이저의 토큰 세분화 정도에 대한 하이퍼파라미터 선택을 안내할 도구가 부족합니다. 둘째, 서브토크나이저의 함수 형태가 일반적으로 사용되는 BPE(Byte-Pair-Encoding) 토크나이저와 결합될 경우 우도 추정 성능이 현저히 저하된다는 점을 발견했습니다. 이러한 한계를 해결하기 위해 MDM-Prime의 변분 하한 경계의 조임(tightness)을 분석하고, 이진 인코딩(Binary Encoding)과 인덱스 셔플링(Index Shuffling)을 통합한 마스크 확산 언어 모델인 MDM-Prime-v2를 개발했습니다. 규모 확장성 분석 결과, MDM-Prime-v2는 자기회귀 모델(ARM) 대비 21.8배 높은 계산 효율성을 보였습니다. 계산-최적 비교 평가에서 MDM-Prime-v2는 OpenWebText 데이터셋에서 7.77의 퍼플렉서티를 달성하여 ARM(12.99), MDM(18.94), MDM-Prime(13.41)을 모두 능가했습니다. 모델 크기를 11억 매개변수로 확장했을 때, 우리 모델은 다양한 상식 추론 작업에서 우수한 제로샷 정확도를 추가로 입증했습니다.

English

Masked diffusion models (MDM) exhibit superior generalization when learned using a Partial masking scheme (Prime). This approach converts tokens into sub-tokens and models the diffusion process at the sub-token level. We identify two limitations of the MDM-Prime framework. First, we lack tools to guide the hyperparameter choice of the token granularity in the subtokenizer. Second, we find that the function form of the subtokenizer significantly degrades likelihood estimation when paired with commonly used Byte-Pair-Encoding (BPE) tokenizers. To address these limitations, we study the tightness of the variational bound in MDM-Prime and develop MDM-Prime-v2, a masked diffusion language model which incorporates Binary Encoding and Index Shuffling. Our scaling analysis reveals that MDM-Prime-v2 is 21.8times more compute-efficient than autoregressive models (ARM). In compute-optimal comparisons, MDM-Prime-v2 achieves 7.77 perplexity on OpenWebText, outperforming ARM (12.99), MDM (18.94), and MDM-Prime (13.41). When extending the model size to 1.1B parameters, our model further demonstrates superior zero-shot accuracy on various commonsense reasoning tasks.

MDM-Prime-v2: 이진 인코딩과 인덱스 셔플링이 확산 언어 모델의 계산 최적 스케일링을 가능하게 하다

MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Compute-optimal Scaling of Diffusion Language Models

초록

Support