Bielik v3 7Bおよび11Bシリーズにおけるトークナイザー最適化を通じたポーランド語言語モデリングの進展

要旨

Bielik v3 PLシリーズ（7Bおよび11Bパラメータ変種を含む）の開発は、言語特化型大規模言語モデル（LLM）最適化の分野において重要なマイルストーンとなる。汎用モデルはしばしば印象的な多言語能力を示すが、普遍的なトークナイザーの使用という根本的なアーキテクチャ上の非効率性に頻繁に悩まされる。これらのトークナイザーは通常、幅広い言語をカバーするように設計されているが、ポーランド語のような特定の言語の形態論的ニュアンスを捉えきれず、高いファーティリティ比、推論コストの増加、有効コンテキストウィンドウの制限をもたらすことが多い。本報告書では、Bielik v3モデルにおいて、Mistralベースの普遍的なトークナイズから専用のポーランド語最適化ボキャブラリーへの移行を詳述する。FOCUSベースの埋め込み初期化、多段階の事前学習カリキュラム、そして教師ありファインチューニング、直接選好最適化、検証可能な報酬を用いたグループ相対政策最適化による強化学習を含むその後の事後学習アライメントについて考察する。

English

The development of the Bielik v3 PL series, encompassing both the 7B and 11B parameter variants, represents a significant milestone in the field of language-specific large language model (LLM) optimization. While general-purpose models often demonstrate impressive multilingual capabilities, they frequently suffer from a fundamental architectural inefficiency: the use of universal tokenizers. These tokenizers, typically designed to cover a broad spectrum of languages, often fail to capture the morphological nuances of specific languages like Polish, leading to higher fertility ratios, increased inference costs, and restricted effective context windows. This report details the transition from the universal Mistral-based tokenization to a dedicated Polish-optimized vocabulary for the Bielik v3 models, exploring the FOCUS-based embedding initialization, the multi-stage pretraining curriculum, and the subsequent post-training alignment involving Supervised Fine-Tuning, Direct Preference Optimization, and Reinforcement Learning through Group Relative Policy Optimization with verifiable rewards.

Bielik v3 7Bおよび11Bシリーズにおけるトークナイザー最適化を通じたポーランド語言語モデリングの進展

Advancing Polish Language Modeling through Tokenizer Optimization in the Bielik v3 7B and 11B Series

要旨

Support