大型語言模型中有害提示的高效偵測

摘要

大型語言模型（LLMs）如ChatGPT和Gemini顯著推進了自然語言處理，使得各種應用如聊天機器人和自動內容生成成為可能。然而，這些模型可能被惡意個人利用，他們製作有害的或不道德的提示來引發有害或不道德的回應。這些個人通常使用越獄技術來繞過安全機制，突顯了對堅固的有害提示檢測方法的需求。現有的檢測技術，無論是黑盒還是白盒，都面臨與有害提示的多樣性、可擴展性和計算效率相關的挑戰。為此，我們提出了ToxicDetector，一種輕量級的灰盒方法，旨在有效檢測LLMs中的有害提示。ToxicDetector利用LLMs創建有害概念提示，使用嵌入向量形成特徵向量，並使用多層感知器（MLP）分類器進行提示分類。我們對各種版本的LLama模型、Gemma-2和多個數據集的評估表明，ToxicDetector實現了高達96.39％的準確率和2.00％的低誤報率，優於最先進的方法。此外，ToxicDetector每個提示的處理時間為0.0780秒，非常適合實時應用。ToxicDetector實現了高準確性、效率和可擴展性，使其成為LLMs中有害提示檢測的實用方法。

English

Large language models (LLMs) like ChatGPT and Gemini have significantly advanced natural language processing, enabling various applications such as chatbots and automated content generation. However, these models can be exploited by malicious individuals who craft toxic prompts to elicit harmful or unethical responses. These individuals often employ jailbreaking techniques to bypass safety mechanisms, highlighting the need for robust toxic prompt detection methods. Existing detection techniques, both blackbox and whitebox, face challenges related to the diversity of toxic prompts, scalability, and computational efficiency. In response, we propose ToxicDetector, a lightweight greybox method designed to efficiently detect toxic prompts in LLMs. ToxicDetector leverages LLMs to create toxic concept prompts, uses embedding vectors to form feature vectors, and employs a Multi-Layer Perceptron (MLP) classifier for prompt classification. Our evaluation on various versions of the LLama models, Gemma-2, and multiple datasets demonstrates that ToxicDetector achieves a high accuracy of 96.39\% and a low false positive rate of 2.00\%, outperforming state-of-the-art methods. Additionally, ToxicDetector's processing time of 0.0780 seconds per prompt makes it highly suitable for real-time applications. ToxicDetector achieves high accuracy, efficiency, and scalability, making it a practical method for toxic prompt detection in LLMs.

大型語言模型中有害提示的高效偵測

Efficient Detection of Toxic Prompts in Large Language Models

摘要

Support