Llama Guard:基於LLM的人工智慧對話輸入輸出保護措施
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
December 7, 2023
作者: Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, Madian Khabsa
cs.AI
摘要
我們介紹了Llama Guard,一種基於LLM的輸入輸出保護模型,專為人工智慧對話使用案例而設計。我們的模型包含一個安全風險分類法,這是一個有價值的工具,用於將LLM提示中發現的特定一組安全風險進行分類(即提示分類)。這個分類法還對LLM生成的對這些提示的回應進行分類起著關鍵作用,我們稱這個過程為回應分類。為了進行提示和回應分類,我們精心收集了一個高質量的數據集。Llama Guard是一個Llama2-7b模型,經過我們收集的數據集進行了指令微調,儘管數據量較少,但在現有基準測試中表現出色,如OpenAI的審查評估數據集和ToxicChat,其表現與當前可用的內容審查工具相當或超越。Llama Guard作為一種語言模型,執行多類分類並生成二進制決策分數。此外,Llama Guard的指令微調允許任務的定制和輸出格式的適應。這一功能增強了模型的能力,例如使得能夠調整分類法類別以符合特定用例,並促進零樣本或少樣本提示與輸入中不同分類法的使用。我們正在提供Llama Guard模型權重,並鼓勵研究人員進一步發展和適應,以滿足人工智慧安全社區不斷發展的需求。
English
We introduce Llama Guard, an LLM-based input-output safeguard model geared
towards Human-AI conversation use cases. Our model incorporates a safety risk
taxonomy, a valuable tool for categorizing a specific set of safety risks found
in LLM prompts (i.e., prompt classification). This taxonomy is also
instrumental in classifying the responses generated by LLMs to these prompts, a
process we refer to as response classification. For the purpose of both prompt
and response classification, we have meticulously gathered a dataset of high
quality. Llama Guard, a Llama2-7b model that is instruction-tuned on our
collected dataset, albeit low in volume, demonstrates strong performance on
existing benchmarks such as the OpenAI Moderation Evaluation dataset and
ToxicChat, where its performance matches or exceeds that of currently available
content moderation tools. Llama Guard functions as a language model, carrying
out multi-class classification and generating binary decision scores.
Furthermore, the instruction fine-tuning of Llama Guard allows for the
customization of tasks and the adaptation of output formats. This feature
enhances the model's capabilities, such as enabling the adjustment of taxonomy
categories to align with specific use cases, and facilitating zero-shot or
few-shot prompting with diverse taxonomies at the input. We are making Llama
Guard model weights available and we encourage researchers to further develop
and adapt them to meet the evolving needs of the community for AI safety.