Diffutron: トルコ語のためのマスク拡散言語モデル

要旨

Masked Diffusion Language Models (MDLM) は、標準的な大規模言語モデルに対する魅力的な非自己回帰的代替として登場したが、形態素的に豊かな言語への応用は依然として限られている。本論文では、トルコ語に特化して設計されたマスク拡散言語モデルDiffutronを提案する。我々のアプローチは、大規模コーパスを用いた多言語エンコーダのLoRAベースの継続事前学習から始まる、リソース効率の高い訓練パイプラインを活用する。生成能力を実現するため、一般的な指示セットとタスク特化的な指示セットに対してモデルを順次適応させる段階的な指示チューニング戦略を採用する。包括的なベンチマークによる実験結果は、コンパクトなサイズにもかかわらず、本モデルが既存の数十億パラメータベースラインモデルと比較して競争力のある性能を達成することを示す。これらの知見は、トルコ語における非自己回帰的テキスト生成のためのマスク拡散モデリングと多段階チューニングを組み合わせた手法の有効性を実証するものである。

English

Masked Diffusion Language Models (MDLMs) have emerged as a compelling non-autoregressive alternative to standard large language models; however, their application to morphologically rich languages remains limited. In this paper, we introduce Diffutron, a masked diffusion language model specifically designed for Turkish. Our approach leverages a resource-efficient training pipeline, starting with LoRA-based continual pre-training of a multilingual encoder on a large-scale corpus. To enable generative capabilities, we employ a progressive instruction-tuning strategy, sequentially adapting the model on general and task-specific instruction sets. Experimental results across comprehensive benchmarks demonstrate that, despite its compact size, our model achieves competitive performance compared to existing multi-billion-parameter baselines. These findings validate the effectiveness of masked diffusion modeling combined with multi-stage tuning for non-autoregressive text generation in Turkish.

Diffutron: トルコ語のためのマスク拡散言語モデル

Diffutron: A Masked Diffusion Language Model for Turkish Language

要旨

Support