ChatPaper.aiChatPaper

Mecellem模型集:专为法律领域从零训练及持续预训练的土耳其模型

Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain

January 22, 2026
作者: Özgür Uğur, Mahmut Göksu, Mahmut Çimen, Musa Yılmaz, Esra Şavirdi, Alp Talha Demir, Rumeysa Güllüce, İclal Çetin, Ömer Can Sağbaş
cs.AI

摘要

本文提出Mecellem模型框架,该框架通过领域自适应策略为土耳其法律领域开发专用语言模型。我们贡献包括:(1)从头预训练的编码器模型:基于ModernBERT的双向编码器,在1127亿土耳其语主导的语料库上预训练。我们实施检查点选择策略,通过训练全程评估下游检索性能,发现最优检查点在预训练损失达到最小值前即可获得最佳检索分数。我们的编码器模型在土耳其检索排行榜位列前三,较小模型(1.55亿参数)与更大参考模型(3.07亿-5.67亿参数)性能相当。相比最先进模型,我们的方法实现92.36%的生产效率(embeddinggemma-300m:100.00%,BAAI/bge-m3:99.54%,newmindai/bge-m3-stsb:94.38%),尽管计算资源需求更低仍位列第四。SOTA模型依赖计算密集型多阶段训练流程,而我们的单阶段预训练加高效后训练方法成为具成本效益的替代方案;(2)持续预训练解码器模型:通过受控课程学习将Qwen3-1.7B和Qwen3-4B模型适配土耳其法律领域。四阶段持续预训练配合最优样本比例,实现从通用语言知识到专业法律术语及长上下文推理的渐进过渡。该方法在土耳其法律文本上困惑度降低36.2%,彰显领域自适应优势。
English
This paper presents Mecellem models, a framework for developing specialized language models for the Turkish legal domain through domain adaptation strategies. We make two contributions: (1)Encoder Model Pre-trained from Scratch: ModernBERT-based bidirectional encoders pre-trained on a Turkish-dominant corpus of 112.7 billion tokens. We implement a checkpoint selection strategy that evaluates downstream retrieval performance throughout training, revealing that optimal checkpoints achieve best retrieval scores before pre-training loss reaches its minimum. Our encoder models achieve top-3 rankings on the Turkish retrieval leaderboard, with smaller models (155M parameters) achieving comparable performance to larger reference models (307M-567M parameters). Our approach achieves 92.36% production efficiency compared to state-of-the-art models (embeddinggemma-300m: 100.00%, BAAI/bge-m3: 99.54%, newmindai/bge-m3-stsb: 94.38%), ranking fourth overall despite requiring less computational resources. SOTA models rely on multi-stage, computationally intensive training pipelines, making our single-stage pre-training followed by efficient post-training approach a cost-effective alternative; (2)Decoder Model with Continual Pre-training (CPT): Qwen3-1.7B and Qwen3-4B models adapted to Turkish legal domain through controlled curriculum learning. Four-phase CPT with optimal sample ratios enables gradual transition from general language knowledge to specialized legal terminology and long-context reasoning. This approach achieves 36.2% perplexity reduction on Turkish legal text, demonstrating domain adaptation gains.
PDF73January 27, 2026