面向大语言模型的分类自适应审核机制与鲁棒防护体系
Taxonomy-Adaptive Moderation Model with Robust Guardrails for Large Language Models
December 5, 2025
作者: Mahesh Kumar Nandwana, Youngwan Lim, Joseph Liu, Alex Yang, Varun Notibala, Nishchaie Khanna
cs.AI
摘要
大型语言模型通常在训练后阶段进行安全对齐,但仍可能生成不当输出,对用户构成潜在风险。这一挑战凸显了在模型输入与输出两端建立强健防护机制的必要性。本研究推出Roblox Guard 1.0——一种基于指令微调的先进大语言模型,通过构建多级LLM管道实现全流程输入输出审核,以提升LLM系统的安全性。该模型以Llama-3.1-8B-Instruct为基座,经过指令微调后能够泛化至未见过的安全分类体系,并在跨领域安全基准测试中表现出色。微调过程融合了合成与开源安全数据集,通过思维链推理和输入反转技术增强上下文理解与决策能力。为支持系统化评估,我们同步发布RobloxGuard-Eval基准测试平台,其具备可扩展的安全分类法,专门用于评估LLM防护栏与内容审核框架的有效性。
English
Large Language Models (LLMs) are typically aligned for safety during the post-training phase; however, they may still generate inappropriate outputs that could potentially pose risks to users. This challenge underscores the need for robust safeguards that operate across both model inputs and outputs. In this work, we introduce Roblox Guard 1.0, a state-of-the-art instruction fine-tuned LLM designed to enhance the safety of LLM systems through comprehensive input-output moderation, using a pipeline of LLMs to enhance moderation capability. Built on the Llama-3.1-8B-Instruct backbone, our model is instruction fine-tuned to generalize across previously unseen safety taxonomies and demonstrates strong performance on out-of-domain safety benchmarks. The instruction fine-tuning process uses a mix of synthetic and open-source safety datasets, augmented with chain-of-thought (CoT) rationales and input inversion to enhance contextual understanding and decision making. To support systematic evaluation, we also release RobloxGuard-Eval, a new benchmark featuring an extensible safety taxonomy to assess the effectiveness of LLM guardrails and moderation frameworks.