Llama Guard: 인간-AI 대화를 위한 LLM 기반 입력-출력 안전장치

초록

Llama Guard를 소개합니다. 이는 인간-AI 대화 사용 사례에 초점을 맞춘 LLM 기반 입력-출력 안전장치 모델입니다. 우리의 모델은 LLM 프롬프트에서 발견되는 특정 안전 위험을 분류하기 위한 유용한 도구인 안전 위험 분류 체계를 통합하고 있습니다. 이 분류 체계는 또한 이러한 프롬프트에 대한 LLM의 응답을 분류하는 데 중요한 역할을 하며, 이 과정을 우리는 응답 분류라고 부릅니다. 프롬프트 및 응답 분류를 위해, 우리는 고품질의 데이터셋을 세심하게 수집했습니다. Llama Guard는 Llama2-7b 모델로, 수집된 데이터셋에 대해 지시 미세 조정을 거쳤으며, 비록 데이터 양은 적지만 OpenAI Moderation Evaluation 데이터셋 및 ToxicChat과 같은 기존 벤치마크에서 강력한 성능을 보여줍니다. 이 모델의 성능은 현재 사용 가능한 콘텐츠 조정 도구와 동등하거나 이를 능가합니다. Llama Guard는 다중 클래스 분류를 수행하고 이진 결정 점수를 생성하는 언어 모델로 기능합니다. 또한, Llama Guard의 지시 미세 조정은 작업의 사용자 정의 및 출력 형식의 조정을 가능하게 합니다. 이 기능은 특정 사용 사례에 맞게 분류 체계 범주를 조정하거나, 다양한 분류 체계를 입력으로 사용하여 제로샷 또는 퓨샷 프롬프팅을 가능하게 하는 등 모델의 기능을 향상시킵니다. 우리는 Llama Guard 모델 가중치를 공개하며, 연구자들이 이를 더 발전시키고 AI 안전에 대한 커뮤니티의 진화하는 요구를 충족할 수 있도록 적극 권장합니다.

English

We introduce Llama Guard, an LLM-based input-output safeguard model geared towards Human-AI conversation use cases. Our model incorporates a safety risk taxonomy, a valuable tool for categorizing a specific set of safety risks found in LLM prompts (i.e., prompt classification). This taxonomy is also instrumental in classifying the responses generated by LLMs to these prompts, a process we refer to as response classification. For the purpose of both prompt and response classification, we have meticulously gathered a dataset of high quality. Llama Guard, a Llama2-7b model that is instruction-tuned on our collected dataset, albeit low in volume, demonstrates strong performance on existing benchmarks such as the OpenAI Moderation Evaluation dataset and ToxicChat, where its performance matches or exceeds that of currently available content moderation tools. Llama Guard functions as a language model, carrying out multi-class classification and generating binary decision scores. Furthermore, the instruction fine-tuning of Llama Guard allows for the customization of tasks and the adaptation of output formats. This feature enhances the model's capabilities, such as enabling the adjustment of taxonomy categories to align with specific use cases, and facilitating zero-shot or few-shot prompting with diverse taxonomies at the input. We are making Llama Guard model weights available and we encourage researchers to further develop and adapt them to meet the evolving needs of the community for AI safety.

Llama Guard: 인간-AI 대화를 위한 LLM 기반 입력-출력 안전장치

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

초록

Support