디퓨트론: 터키어를 위한 마스크 확산 언어 모델

초록

마스크 확산 언어 모델(MDLM)은 표준 대규모 언어 모델에 대한 비자회귀적 대안으로 주목받고 있으나, 형태론적으로 풍부한 언어에 대한 적용은 여전히 제한적이다. 본 논문에서는 튀르키예어에 특화된 마스크 확산 언어 모델인 Diffutron을 소개한다. 우리의 접근법은 대규모 코퍼스에서 다국어 인코더의 LoRA 기반 지속 사전 학습으로 시작하는 자원 효율적인 훈련 파이프라인을 활용한다. 생성 능력을 구현하기 위해 일반 및 작업 특화 명령어 세트에 대해 순차적으로 모델을 적응시키는 점진적 지시 미세 조정 전략을 채택한다. 포괄적 벤치마크에서의 실험 결과는 우리 모델이 컴팩트한 크기에도 불구하고 기존 수십억 개 매개변수 기준 모델 대비 경쟁력 있는 성능을 달성함을 보여준다. 이러한 결과는 튀르키예어 비자회귀 텍스트 생성을 위해 마스크 확산 모델링과 다단계 조정을 결합한 접근법의 효과성을 입증한다.

English

Masked Diffusion Language Models (MDLMs) have emerged as a compelling non-autoregressive alternative to standard large language models; however, their application to morphologically rich languages remains limited. In this paper, we introduce Diffutron, a masked diffusion language model specifically designed for Turkish. Our approach leverages a resource-efficient training pipeline, starting with LoRA-based continual pre-training of a multilingual encoder on a large-scale corpus. To enable generative capabilities, we employ a progressive instruction-tuning strategy, sequentially adapting the model on general and task-specific instruction sets. Experimental results across comprehensive benchmarks demonstrate that, despite its compact size, our model achieves competitive performance compared to existing multi-billion-parameter baselines. These findings validate the effectiveness of masked diffusion modeling combined with multi-stage tuning for non-autoregressive text generation in Turkish.

디퓨트론: 터키어를 위한 마스크 확산 언어 모델

Diffutron: A Masked Diffusion Language Model for Turkish Language

초록

Support