通过优化Bielik v3 7B与11B系列分词器推进波兰语语言建模发展

摘要

Bielik v3 PL系列（包含70亿和110亿参数版本）的开发，标志着语言特异性大语言模型优化领域的重要里程碑。尽管通用模型通常展现出卓越的多语言能力，但其基础架构存在固有低效问题：通用分词器的使用。这类旨在覆盖广泛语言的分词器往往难以准确捕捉波兰语等特定语言的形态学特征，导致生育率比值偏高、推理成本增加，以及有效上下文窗口受限。本报告详述了Bielik v3模型从基于Mistral的通用分词方案转向专用波兰语优化词表的转型过程，重点探讨了基于FOCUS的嵌入初始化策略、多阶段预训练课程设计，以及后续包含监督微调、直接偏好优化和采用可验证奖励的群体相对策略优化强化学习的对齐训练流程。

English

The development of the Bielik v3 PL series, encompassing both the 7B and 11B parameter variants, represents a significant milestone in the field of language-specific large language model (LLM) optimization. While general-purpose models often demonstrate impressive multilingual capabilities, they frequently suffer from a fundamental architectural inefficiency: the use of universal tokenizers. These tokenizers, typically designed to cover a broad spectrum of languages, often fail to capture the morphological nuances of specific languages like Polish, leading to higher fertility ratios, increased inference costs, and restricted effective context windows. This report details the transition from the universal Mistral-based tokenization to a dedicated Polish-optimized vocabulary for the Bielik v3 models, exploring the FOCUS-based embedding initialization, the multi-stage pretraining curriculum, and the subsequent post-training alignment involving Supervised Fine-Tuning, Direct Preference Optimization, and Reinforcement Learning through Group Relative Policy Optimization with verifiable rewards.

通过优化Bielik v3 7B与11B系列分词器推进波兰语语言建模发展

Advancing Polish Language Modeling through Tokenizer Optimization in the Bielik v3 7B and 11B Series

摘要

Support