RabakBench：扩展人工标注以构建面向低资源语言的本地化多语言安全基准

摘要

大型语言模型（LLMs）及其安全分类器在低资源语言上往往表现欠佳，这主要归因于有限的训练数据和评估基准。本文介绍了RabakBench，一个针对新加坡独特语言环境本地化的新型多语言安全基准，涵盖新加坡英语、中文、马来语和泰米尔语。RabakBench通过一个可扩展的三阶段流程构建：(i) 生成——利用LLM驱动的红队策略增强真实新加坡英语网络内容，生成对抗性示例；(ii) 标注——采用多数投票的LLM标注器进行半自动化多标签安全标注，确保与人类判断一致；(iii) 翻译——进行高保真翻译，保留跨语言的细微差别和毒性。最终数据集包含超过5,000个安全标注示例，覆盖四种语言和六个细粒度安全类别，并附带严重程度等级。对11个流行的开源和闭源防护分类器的评估显示，其性能显著下降。RabakBench不仅支持在东南亚多语言环境中进行稳健的安全评估，还提供了一个可复制的框架，用于在低资源环境下构建本地化安全数据集。该基准数据集，包括经过人工验证的翻译和评估代码，均已公开可用。

English

Large language models (LLMs) and their safety classifiers often perform poorly on low-resource languages due to limited training data and evaluation benchmarks. This paper introduces RabakBench, a new multilingual safety benchmark localized to Singapore's unique linguistic context, covering Singlish, Chinese, Malay, and Tamil. RabakBench is constructed through a scalable three-stage pipeline: (i) Generate - adversarial example generation by augmenting real Singlish web content with LLM-driven red teaming; (ii) Label - semi-automated multi-label safety annotation using majority-voted LLM labelers aligned with human judgments; and (iii) Translate - high-fidelity translation preserving linguistic nuance and toxicity across languages. The final dataset comprises over 5,000 safety-labeled examples across four languages and six fine-grained safety categories with severity levels. Evaluations of 11 popular open-source and closed-source guardrail classifiers reveal significant performance degradation. RabakBench not only enables robust safety evaluation in Southeast Asian multilingual settings but also offers a reproducible framework for building localized safety datasets in low-resource environments. The benchmark dataset, including the human-verified translations, and evaluation code are publicly available.

RabakBench：扩展人工标注以构建面向低资源语言的本地化多语言安全基准

RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages

摘要

Support