ChatPaper.aiChatPaper

RabakBench:擴展人工註解以構建低資源語言的本地化多語言安全基準

RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages

July 8, 2025
作者: Gabriel Chua, Leanne Tan, Ziyu Ge, Roy Ka-Wei Lee
cs.AI

摘要

大型語言模型(LLMs)及其安全分類器在低資源語言上往往表現不佳,這主要受限於訓練數據和評估基準的不足。本文介紹了RabakBench,這是一個針對新加坡獨特語言環境進行本地化的多語言安全基準,涵蓋了新加坡式英語、中文、馬來語和泰米爾語。RabakBench通過一個可擴展的三階段流程構建:(i) 生成——利用LLM驅動的紅隊策略對真實的新加坡式英語網絡內容進行增強,生成對抗性示例;(ii) 標註——採用多數表決的LLM標註器進行半自動化的多標籤安全註釋,並與人類判斷保持一致;(iii) 翻譯——進行高保真翻譯,確保跨語言間的語言細微差別和毒性得以保留。最終的數據集包含超過5,000個安全標註的示例,覆蓋四種語言和六個細粒度安全類別,並附有嚴重程度等級。對11個流行的開源和閉源防護分類器的評估顯示出顯著的性能下降。RabakBench不僅能在東南亞多語言環境中實現穩健的安全評估,還為在低資源環境中構建本地化安全數據集提供了一個可重複的框架。該基準數據集,包括經過人工驗證的翻譯和評估代碼,均已公開提供。
English
Large language models (LLMs) and their safety classifiers often perform poorly on low-resource languages due to limited training data and evaluation benchmarks. This paper introduces RabakBench, a new multilingual safety benchmark localized to Singapore's unique linguistic context, covering Singlish, Chinese, Malay, and Tamil. RabakBench is constructed through a scalable three-stage pipeline: (i) Generate - adversarial example generation by augmenting real Singlish web content with LLM-driven red teaming; (ii) Label - semi-automated multi-label safety annotation using majority-voted LLM labelers aligned with human judgments; and (iii) Translate - high-fidelity translation preserving linguistic nuance and toxicity across languages. The final dataset comprises over 5,000 safety-labeled examples across four languages and six fine-grained safety categories with severity levels. Evaluations of 11 popular open-source and closed-source guardrail classifiers reveal significant performance degradation. RabakBench not only enables robust safety evaluation in Southeast Asian multilingual settings but also offers a reproducible framework for building localized safety datasets in low-resource environments. The benchmark dataset, including the human-verified translations, and evaluation code are publicly available.
PDF11July 11, 2025