Llama Guard：基於LLM的人工智慧對話輸入輸出保護措施

摘要

我們介紹了Llama Guard，一種基於LLM的輸入輸出保護模型，專為人工智慧對話使用案例而設計。我們的模型包含一個安全風險分類法，這是一個有價值的工具，用於將LLM提示中發現的特定一組安全風險進行分類（即提示分類）。這個分類法還對LLM生成的對這些提示的回應進行分類起著關鍵作用，我們稱這個過程為回應分類。為了進行提示和回應分類，我們精心收集了一個高質量的數據集。Llama Guard是一個Llama2-7b模型，經過我們收集的數據集進行了指令微調，儘管數據量較少，但在現有基準測試中表現出色，如OpenAI的審查評估數據集和ToxicChat，其表現與當前可用的內容審查工具相當或超越。Llama Guard作為一種語言模型，執行多類分類並生成二進制決策分數。此外，Llama Guard的指令微調允許任務的定制和輸出格式的適應。這一功能增強了模型的能力，例如使得能夠調整分類法類別以符合特定用例，並促進零樣本或少樣本提示與輸入中不同分類法的使用。我們正在提供Llama Guard模型權重，並鼓勵研究人員進一步發展和適應，以滿足人工智慧安全社區不斷發展的需求。

English

We introduce Llama Guard, an LLM-based input-output safeguard model geared towards Human-AI conversation use cases. Our model incorporates a safety risk taxonomy, a valuable tool for categorizing a specific set of safety risks found in LLM prompts (i.e., prompt classification). This taxonomy is also instrumental in classifying the responses generated by LLMs to these prompts, a process we refer to as response classification. For the purpose of both prompt and response classification, we have meticulously gathered a dataset of high quality. Llama Guard, a Llama2-7b model that is instruction-tuned on our collected dataset, albeit low in volume, demonstrates strong performance on existing benchmarks such as the OpenAI Moderation Evaluation dataset and ToxicChat, where its performance matches or exceeds that of currently available content moderation tools. Llama Guard functions as a language model, carrying out multi-class classification and generating binary decision scores. Furthermore, the instruction fine-tuning of Llama Guard allows for the customization of tasks and the adaptation of output formats. This feature enhances the model's capabilities, such as enabling the adjustment of taxonomy categories to align with specific use cases, and facilitating zero-shot or few-shot prompting with diverse taxonomies at the input. We are making Llama Guard model weights available and we encourage researchers to further develop and adapt them to meet the evolving needs of the community for AI safety.

Llama Guard：基於LLM的人工智慧對話輸入輸出保護措施

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

摘要

Support