Bielik v3 7B 및 11B 시리즈의 토크나이저 최적화를 통한 폴란드어 언어 모델링 발전

초록

Bielik v3 PL 시리즈(7B 및 11B 매개변수 변종 포함)의 개발은 언어 특화 대규모 언어 모델(LLM) 최적화 분야에서 중요한 이정표를 의미합니다. 범용 모델은 종종 인상적인 다국어 능력을 보여주지만, 보편적 토크나이저 사용이라는 근본적인 구조적 비효율성을 자주 겪습니다. 일반적으로 광범위한 언어 스펙트럼을 포괄하도록 설계된 이러한 토크나이저는 폴란드어와 같은 특정 언어의 형태론적 뉘앙스를 제대로捕捉하지 못해 높은 fertility 비율, 증가된 추론 비용, 제한된 유효 컨텍스트 창으로 이어지는 경우가 많습니다. 본 보고서는 Bielik v3 모델을 위해 범용 Mistral 기반 토큰화에서 전용 폴란드어 최적화 어휘 사전으로의 전환을 상세히 설명하며, FOCUS 기반 임베딩 초기화, 다단계 사전 학습 커리큘럼, 그리고 지도 미세 조정, 직접 선호도 최적화, 검증 가능한 보상을 통한 그룹 상대 정책 최적화의 강화 학습을 포함한 이후의 사후 학습 정렬 과정을 탐구합니다.

English

The development of the Bielik v3 PL series, encompassing both the 7B and 11B parameter variants, represents a significant milestone in the field of language-specific large language model (LLM) optimization. While general-purpose models often demonstrate impressive multilingual capabilities, they frequently suffer from a fundamental architectural inefficiency: the use of universal tokenizers. These tokenizers, typically designed to cover a broad spectrum of languages, often fail to capture the morphological nuances of specific languages like Polish, leading to higher fertility ratios, increased inference costs, and restricted effective context windows. This report details the transition from the universal Mistral-based tokenization to a dedicated Polish-optimized vocabulary for the Bielik v3 models, exploring the FOCUS-based embedding initialization, the multi-stage pretraining curriculum, and the subsequent post-training alignment involving Supervised Fine-Tuning, Direct Preference Optimization, and Reinforcement Learning through Group Relative Policy Optimization with verifiable rewards.

Bielik v3 7B 및 11B 시리즈의 토크나이저 최적화를 통한 폴란드어 언어 모델링 발전

Advancing Polish Language Modeling through Tokenizer Optimization in the Bielik v3 7B and 11B Series

초록

Support