大型语言模型中有毒提示的高效检测

摘要

像ChatGPT和Gemini这样的大型语言模型(LLMs)已经显著推动了自然语言处理的发展，使得诸如聊天机器人和自动内容生成等各种应用成为可能。然而，这些模型可能会被恶意个体利用，他们会精心设计有害或不道德的提示，以引发有害或不道德的回应。这些个体通常会采用越狱技术来绕过安全机制，突显了对强大的有害提示检测方法的需求。现有的检测技术，无论是黑盒还是白盒，都面临着与有害提示的多样性、可扩展性和计算效率相关的挑战。为此，我们提出了ToxicDetector，这是一种轻量级的灰盒方法，旨在高效地检测LLMs中的有害提示。ToxicDetector利用LLMs创建有害概念提示，使用嵌入向量形成特征向量，并采用多层感知器(MLP)分类器进行提示分类。我们对LLama模型的各个版本、Gemma-2以及多个数据集的评估表明，ToxicDetector实现了高达96.39\%的准确率和2.00\%的低误报率，优于现有技术。此外，ToxicDetector每个提示的处理时间为0.0780秒，非常适用于实时应用。ToxicDetector实现了高准确性、高效性和可扩展性，使其成为LLMs中有害提示检测的实用方法。

English

Large language models (LLMs) like ChatGPT and Gemini have significantly advanced natural language processing, enabling various applications such as chatbots and automated content generation. However, these models can be exploited by malicious individuals who craft toxic prompts to elicit harmful or unethical responses. These individuals often employ jailbreaking techniques to bypass safety mechanisms, highlighting the need for robust toxic prompt detection methods. Existing detection techniques, both blackbox and whitebox, face challenges related to the diversity of toxic prompts, scalability, and computational efficiency. In response, we propose ToxicDetector, a lightweight greybox method designed to efficiently detect toxic prompts in LLMs. ToxicDetector leverages LLMs to create toxic concept prompts, uses embedding vectors to form feature vectors, and employs a Multi-Layer Perceptron (MLP) classifier for prompt classification. Our evaluation on various versions of the LLama models, Gemma-2, and multiple datasets demonstrates that ToxicDetector achieves a high accuracy of 96.39\% and a low false positive rate of 2.00\%, outperforming state-of-the-art methods. Additionally, ToxicDetector's processing time of 0.0780 seconds per prompt makes it highly suitable for real-time applications. ToxicDetector achieves high accuracy, efficiency, and scalability, making it a practical method for toxic prompt detection in LLMs.

大型语言模型中有毒提示的高效检测

Efficient Detection of Toxic Prompts in Large Language Models

摘要

Support