SweEval: LLMは本当に悪態をつくのか？企業利用のための安全性ベンチマークによる限界テスト

要旨

企業顧客は、電子メールの作成、セールスピッチの構築、カジュアルなメッセージの作成といった重要なコミュニケーションタスクにおいて、大規模言語モデル（LLM）をますます採用しています。異なる地域にわたってこれらのモデルを展開するには、多様な文化的・言語的文脈を理解し、安全で敬意のある応答を生成する必要があります。企業アプリケーションにおいては、信頼を維持し、コンプライアンスを確保するために、不適切または攻撃的な言語を効果的に識別し、対処することが極めて重要です。これを解決するため、私たちはSweEvalを導入しました。これは、トーン（肯定的または否定的）と文脈（フォーマルまたはインフォーマル）のバリエーションを含む現実世界のシナリオをシミュレートするベンチマークです。プロンプトでは、タスクを完了する際に特定の罵倒語を含めるようモデルに明示的に指示します。このベンチマークは、LLMがそのような不適切な指示に従うか抵抗するかを評価し、倫理的フレームワーク、文化的ニュアンス、言語理解能力との整合性を測定します。企業利用およびそれ以上の範囲で倫理的に整合したAIシステムを構築する研究を進めるため、データセットとコードを公開しています：https://github.com/amitbcp/multilingual_profanity。

English

Enterprise customers are increasingly adopting Large Language Models (LLMs) for critical communication tasks, such as drafting emails, crafting sales pitches, and composing casual messages. Deploying such models across different regions requires them to understand diverse cultural and linguistic contexts and generate safe and respectful responses. For enterprise applications, it is crucial to mitigate reputational risks, maintain trust, and ensure compliance by effectively identifying and handling unsafe or offensive language. To address this, we introduce SweEval, a benchmark simulating real-world scenarios with variations in tone (positive or negative) and context (formal or informal). The prompts explicitly instruct the model to include specific swear words while completing the task. This benchmark evaluates whether LLMs comply with or resist such inappropriate instructions and assesses their alignment with ethical frameworks, cultural nuances, and language comprehension capabilities. In order to advance research in building ethically aligned AI systems for enterprise use and beyond, we release the dataset and code: https://github.com/amitbcp/multilingual_profanity.

SweEval: LLMは本当に悪態をつくのか？企業利用のための安全性ベンチマークによる限界テスト

SweEval: Do LLMs Really Swear? A Safety Benchmark for Testing Limits for Enterprise Use

要旨

Support