比埃利克卫队:面向大模型内容审核的高效波兰语安全分类器
Bielik Guard: Efficient Polish Language Safety Classifiers for LLM Content Moderation
February 8, 2026
作者: Krzysztof Wróbel, Jan Maria Kowalski, Jerzy Surma, Igor Ciuciura, Maciej Szymański
cs.AI
摘要
随着大型语言模型在波兰语应用中的部署日益增多,对高效精准的内容安全分类器的需求变得至关重要。我们推出Bielik Guard系列——一组紧凑型波兰语安全分类器,包含两种模型变体:基于MMLW-RoBERTa-base的0.1B参数模型和基于PKOBP/polish-roberta-8k的0.5B参数模型。这些模型在6,885条社区标注的波兰语文本数据集上微调而成,可将内容划分为五大安全类别:仇恨/攻击性言论、粗俗内容、色情内容、犯罪内容及自残内容。评估结果表明,两个模型在多项基准测试中均表现优异。0.5B变体在测试集上展现出最佳整体判别能力,其微观F1分数达0.791,宏观F1分数为0.785;而0.1B变体则展现出卓越的效率。值得注意的是,Bielik Guard 0.1B v1.1在真实用户提示词上实现了77.65%的精确度与0.63%的极低误报率,在模型规模相同的情况下显著优于HerBERT-PL-Guard(精确度31.55%,误报率4.70%)。该系列模型已开源发布,其设计理念是提供恰当响应而非简单的内容屏蔽,尤其针对自残等敏感类别。
English
As Large Language Models (LLMs) become increasingly deployed in Polish language applications, the need for efficient and accurate content safety classifiers has become paramount. We present Bielik Guard, a family of compact Polish language safety classifiers comprising two model variants: a 0.1B parameter model based on MMLW-RoBERTa-base and a 0.5B parameter model based on PKOBP/polish-roberta-8k. Fine-tuned on a community-annotated dataset of 6,885 Polish texts, these models classify content across five safety categories: Hate/Aggression, Vulgarities, Sexual Content, Crime, and Self-Harm. Our evaluation demonstrates that both models achieve strong performance on multiple benchmarks. The 0.5B variant offers the best overall discrimination capability with F1 scores of 0.791 (micro) and 0.785 (macro) on the test set, while the 0.1B variant demonstrates exceptional efficiency. Notably, Bielik Guard 0.1B v1.1 achieves superior precision (77.65%) and very low false positive rate (0.63%) on real user prompts, outperforming HerBERT-PL-Guard (31.55% precision, 4.70% FPR) despite identical model size. The models are publicly available and designed to provide appropriate responses rather than simple content blocking, particularly for sensitive categories like self-harm.