Llama Guard:基于LLM的人工智能对话输入输出保护系统
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
December 7, 2023
作者: Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, Madian Khabsa
cs.AI
摘要
我们介绍了Llama Guard,这是一个基于LLM的输入输出保护模型,专为人工智能对话使用场景而设计。我们的模型融合了安全风险分类法,这是一种有价值的工具,用于对LLM提示中发现的一组特定安全风险进行分类(即提示分类)。这种分类法还在对LLM生成的回应进行分类时起着关键作用,我们称之为回应分类。为了进行提示和回应分类,我们精心收集了一个高质量的数据集。Llama Guard是一个Llama2-7b模型,经过我们收集的数据集进行了指令微调,尽管数据量较小,但在现有基准测试中表现出色,如OpenAI Moderation Evaluation数据集和ToxicChat,其性能与当前可用的内容管理工具相匹敌甚至超越。Llama Guard作为一个语言模型,执行多类分类并生成二进制决策分数。此外,Llama Guard的指令微调允许定制任务和调整输出格式。这一特性增强了模型的功能,例如使得能够调整分类法类别以符合特定用例,并促进零样本或少样本提示与输入中的不同分类法。我们提供Llama Guard模型权重,并鼓励研究人员进一步开发和调整,以满足人工智能安全领域不断发展的需求。
English
We introduce Llama Guard, an LLM-based input-output safeguard model geared
towards Human-AI conversation use cases. Our model incorporates a safety risk
taxonomy, a valuable tool for categorizing a specific set of safety risks found
in LLM prompts (i.e., prompt classification). This taxonomy is also
instrumental in classifying the responses generated by LLMs to these prompts, a
process we refer to as response classification. For the purpose of both prompt
and response classification, we have meticulously gathered a dataset of high
quality. Llama Guard, a Llama2-7b model that is instruction-tuned on our
collected dataset, albeit low in volume, demonstrates strong performance on
existing benchmarks such as the OpenAI Moderation Evaluation dataset and
ToxicChat, where its performance matches or exceeds that of currently available
content moderation tools. Llama Guard functions as a language model, carrying
out multi-class classification and generating binary decision scores.
Furthermore, the instruction fine-tuning of Llama Guard allows for the
customization of tasks and the adaptation of output formats. This feature
enhances the model's capabilities, such as enabling the adjustment of taxonomy
categories to align with specific use cases, and facilitating zero-shot or
few-shot prompting with diverse taxonomies at the input. We are making Llama
Guard model weights available and we encourage researchers to further develop
and adapt them to meet the evolving needs of the community for AI safety.