Llama Guard: 人間-AI会話のためのLLMベースの入力出力セーフガード

要旨

Llama Guardを紹介します。これは、人間とAIの対話ユースケースに向けたLLMベースの入力出力保護モデルです。本モデルは、LLMプロンプト（つまりプロンプト分類）に見られる特定の安全リスクを分類するための有用なツールである安全リスク分類体系を組み込んでいます。この分類体系は、これらのプロンプトに対するLLMの応答を分類するプロセス（応答分類と呼びます）においても重要な役割を果たします。プロンプト分類と応答分類の両方の目的で、高品質なデータセットを慎重に収集しました。Llama Guardは、収集したデータセット（量は少ないものの）で命令チューニングされたLlama2-7bモデルであり、OpenAI Moderation EvaluationデータセットやToxicChatなどの既存のベンチマークにおいて、現在利用可能なコンテンツモデレーションツールと同等またはそれ以上の性能を示しています。Llama Guardは言語モデルとして機能し、マルチクラス分類を実行し、バイナリ決定スコアを生成します。さらに、Llama Guardの命令チューニングにより、タスクのカスタマイズや出力形式の適応が可能です。この機能により、特定のユースケースに合わせて分類体系のカテゴリを調整したり、入力時に多様な分類体系でゼロショットまたは少数ショットのプロンプティングを容易にしたりするなど、モデルの能力が向上します。Llama Guardのモデルウェイトを公開し、研究者がAI安全の進化するコミュニティニーズに応じてさらに開発・適応することを奨励します。

English

We introduce Llama Guard, an LLM-based input-output safeguard model geared towards Human-AI conversation use cases. Our model incorporates a safety risk taxonomy, a valuable tool for categorizing a specific set of safety risks found in LLM prompts (i.e., prompt classification). This taxonomy is also instrumental in classifying the responses generated by LLMs to these prompts, a process we refer to as response classification. For the purpose of both prompt and response classification, we have meticulously gathered a dataset of high quality. Llama Guard, a Llama2-7b model that is instruction-tuned on our collected dataset, albeit low in volume, demonstrates strong performance on existing benchmarks such as the OpenAI Moderation Evaluation dataset and ToxicChat, where its performance matches or exceeds that of currently available content moderation tools. Llama Guard functions as a language model, carrying out multi-class classification and generating binary decision scores. Furthermore, the instruction fine-tuning of Llama Guard allows for the customization of tasks and the adaptation of output formats. This feature enhances the model's capabilities, such as enabling the adjustment of taxonomy categories to align with specific use cases, and facilitating zero-shot or few-shot prompting with diverse taxonomies at the input. We are making Llama Guard model weights available and we encourage researchers to further develop and adapt them to meet the evolving needs of the community for AI safety.

Llama Guard: 人間-AI会話のためのLLMベースの入力出力セーフガード

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

要旨

Support