Bielik v3 Small：技術レポート

要旨

Bielik v3を紹介します。これは、ポーランド語処理に最適化されたパラメータ効率の高い生成テキストモデル（1.5Bおよび4.5B）のシリーズです。これらのモデルは、より小さくても最適化されたアーキテクチャが、大幅に少ない計算リソースで、はるかに大規模なモデルと同等の性能を達成できることを示しています。私たちのアプローチには、いくつかの重要な革新が含まれています：トークン効率を大幅に向上させるカスタムポーランド語トークナイザー（APT4）、指示タイプ間の学習バランスを取るための重み付き指示クロスエントロピー損失、そしてトレーニングの進捗に基づいて動的に調整する適応学習率です。303百万ドキュメントにわたる2920億トークンの慎重に選ばれたコーパスでトレーニングされたこれらのモデルは、Open PL LLMリーダーボード、複雑なポーランド語テキスト理解ベンチマーク、ポーランドEQ-Bench、ポーランド医療リーダーボードなど、複数のベンチマークで優れた性能を発揮します。4.5Bパラメータモデルは、そのサイズの2〜3倍のモデルと競合する結果を達成し、1.5Bモデルはその非常にコンパクトなプロファイルにもかかわらず強力な性能を提供します。これらの進歩により、十分に代表されていない言語におけるパラメータ効率の高い言語モデリングの新しいベンチマークが確立され、リソースに制約のあるアプリケーション向けに高品質なポーランド語AIがよりアクセスしやすくなりました。

English

We introduce Bielik v3, a series of parameter-efficient generative text models (1.5B and 4.5B) optimized for Polish language processing. These models demonstrate that smaller, well-optimized architectures can achieve performance comparable to much larger counterparts while requiring substantially fewer computational resources. Our approach incorporates several key innovations: a custom Polish tokenizer (APT4) that significantly improves token efficiency, Weighted Instruction Cross-Entropy Loss to balance learning across instruction types, and Adaptive Learning Rate that dynamically adjusts based on training progress. Trained on a meticulously curated corpus of 292 billion tokens spanning 303 million documents, these models excel across multiple benchmarks, including the Open PL LLM Leaderboard, Complex Polish Text Understanding Benchmark, Polish EQ-Bench, and Polish Medical Leaderboard. The 4.5B parameter model achieves results competitive with models 2-3 times its size, while the 1.5B model delivers strong performance despite its extremely compact profile. These advances establish new benchmarks for parameter-efficient language modeling in less-represented languages, making high-quality Polish language AI more accessible for resource-constrained applications.

Bielik v3 Small：技術レポート

Bielik v3 Small: Technical Report

要旨

Support