大規模言語モデルにおける有害プロンプトの効率的検出

要旨

ChatGPTやGeminiなどの大規模言語モデル（LLMs）は、自然言語処理を大幅に進化させ、チャットボットや自動コンテンツ生成などのさまざまなアプリケーションを可能にしました。ただし、これらのモデルは有害または倫理に反する応答を引き出すために有毒なプロンプトを作成する悪意のある個人によって悪用される可能性があります。これらの個人はしばしばセーフティメカニズムをバイパスするためにジェイルブレイキング技術を使用し、頑健な有毒プロンプト検出方法の必要性を浮き彫りにしています。既存の検出技術（ブラックボックスおよびホワイトボックスの両方）は、有毒なプロンプトの多様性、スケーラビリティ、および計算効率に関連する課題に直面しています。このため、我々は、LLMs内で有毒なプロンプトを効率的に検出するために設計された軽量なグレイボックス手法であるToxicDetectorを提案します。ToxicDetectorは、LLMsを活用して有毒な概念プロンプトを作成し、埋め込みベクトルを使用して特徴ベクトルを形成し、プロンプト分類のためにMulti-Layer Perceptron（MLP）分類器を使用します。LLamaモデルのさまざまなバージョン、Gemma-2、および複数のデータセットでの評価により、ToxicDetectorは96.39\%の高い精度と2.00\%の低い偽陽性率を達成し、最先端の手法を凌駕しています。さらに、ToxicDetectorのプロンプトあたりの処理時間は0.0780秒であり、リアルタイムアプリケーションに非常に適しています。ToxicDetectorは高い精度、効率性、およびスケーラビリティを実現し、LLMs内での有毒プロンプト検出のための実用的な手法となっています。

English

Large language models (LLMs) like ChatGPT and Gemini have significantly advanced natural language processing, enabling various applications such as chatbots and automated content generation. However, these models can be exploited by malicious individuals who craft toxic prompts to elicit harmful or unethical responses. These individuals often employ jailbreaking techniques to bypass safety mechanisms, highlighting the need for robust toxic prompt detection methods. Existing detection techniques, both blackbox and whitebox, face challenges related to the diversity of toxic prompts, scalability, and computational efficiency. In response, we propose ToxicDetector, a lightweight greybox method designed to efficiently detect toxic prompts in LLMs. ToxicDetector leverages LLMs to create toxic concept prompts, uses embedding vectors to form feature vectors, and employs a Multi-Layer Perceptron (MLP) classifier for prompt classification. Our evaluation on various versions of the LLama models, Gemma-2, and multiple datasets demonstrates that ToxicDetector achieves a high accuracy of 96.39\% and a low false positive rate of 2.00\%, outperforming state-of-the-art methods. Additionally, ToxicDetector's processing time of 0.0780 seconds per prompt makes it highly suitable for real-time applications. ToxicDetector achieves high accuracy, efficiency, and scalability, making it a practical method for toxic prompt detection in LLMs.

大規模言語モデルにおける有害プロンプトの効率的検出

Efficient Detection of Toxic Prompts in Large Language Models

要旨

Support