基于稳健防护机制的分类自适应审核模型
Taxonomy-Adaptive Moderation Model with Robust Guardrails for Large Language Models
December 5, 2025
作者: Mahesh Kumar Nandwana, Youngwan Lim, Joseph Liu, Alex Yang, Varun Notibala, Nishchaie Khanna
cs.AI
摘要
大型語言模型(LLMs)通常在訓練後階段會進行安全對齊調整,但它們仍可能產生不當輸出,對用戶構成潛在風險。這一挑戰凸顯了需要在模型輸入端與輸出端同時建立穩健防護機制的重要性。本研究推出Roblox Guard 1.0——一款基於指令微調的尖端LLM,通過採用多層級LLM串聯架構實現全面輸入輸出審核,從而提升LLM系統安全性。該模型以Llama-3.1-8B-Instruct為基礎架構,經過指令微調後能泛化應用於未見過的安全分類體系,並在跨領域安全基準測試中展現卓越性能。指令微調過程融合了合成數據與開源安全數據集,並輔以思維鏈(CoT)推理依據及輸入反轉技術,以強化語境理解與決策能力。為支持系統化評估,我們同步發布RobloxGuard-Eval基準測試集,其具備可擴展的安全分類框架,專門用於評估LLM防護欄與審核框架的有效性。
English
Large Language Models (LLMs) are typically aligned for safety during the post-training phase; however, they may still generate inappropriate outputs that could potentially pose risks to users. This challenge underscores the need for robust safeguards that operate across both model inputs and outputs. In this work, we introduce Roblox Guard 1.0, a state-of-the-art instruction fine-tuned LLM designed to enhance the safety of LLM systems through comprehensive input-output moderation, using a pipeline of LLMs to enhance moderation capability. Built on the Llama-3.1-8B-Instruct backbone, our model is instruction fine-tuned to generalize across previously unseen safety taxonomies and demonstrates strong performance on out-of-domain safety benchmarks. The instruction fine-tuning process uses a mix of synthetic and open-source safety datasets, augmented with chain-of-thought (CoT) rationales and input inversion to enhance contextual understanding and decision making. To support systematic evaluation, we also release RobloxGuard-Eval, a new benchmark featuring an extensible safety taxonomy to assess the effectiveness of LLM guardrails and moderation frameworks.