为马来西亚语文本调整适用于安全工作的分类器：增强LLM-Ops框架中的对齐

摘要

随着大型语言模型（LLMs）越来越多地整合到运营工作流程中（LLM-Ops），迫切需要有效的防护措施，以确保安全和对齐的交互，包括检测跨语言的潜在不安全或不恰当内容的能力。然而，现有的适用于工作场所的分类器主要集中在英文文本上。为了填补马来西亚语言领域的这一空白，我们提出了一种专门针对马来西亚语言内容的新型适用于工作场所的文本分类器。通过精心策划和注释一种独一无二的跨多个内容类别的马来西亚文本数据集，我们训练了一个能够使用最先进的自然语言处理技术识别潜在不安全材料的分类模型。这项工作代表了在实现更安全的互动和内容过滤以减轻潜在风险并确保负责任部署LLMs方面的重要一步。为了最大限度地提高可访问性并促进进一步研究以增强LLM-Ops在马来西亚环境中的对齐性，该模型已在以下网址公开发布：https://huggingface.co/malaysia-ai/malaysian-sfw-classifier。

English

As large language models (LLMs) become increasingly integrated into operational workflows (LLM-Ops), there is a pressing need for effective guardrails to ensure safe and aligned interactions, including the ability to detect potentially unsafe or inappropriate content across languages. However, existing safe-for-work classifiers are primarily focused on English text. To address this gap for the Malaysian language, we present a novel safe-for-work text classifier tailored specifically for Malaysian language content. By curating and annotating a first-of-its-kind dataset of Malaysian text spanning multiple content categories, we trained a classification model capable of identifying potentially unsafe material using state-of-the-art natural language processing techniques. This work represents an important step in enabling safer interactions and content filtering to mitigate potential risks and ensure responsible deployment of LLMs. To maximize accessibility and promote further research towards enhancing alignment in LLM-Ops for the Malaysian context, the model is publicly released at https://huggingface.co/malaysia-ai/malaysian-sfw-classifier.

为马来西亚语文本调整适用于安全工作的分类器：增强LLM-Ops框架中的对齐

Adapting Safe-for-Work Classifier for Malaysian Language Text: Enhancing Alignment in LLM-Ops Framework

摘要

Support