SweEval：大型語言模型真的會說髒話嗎？企業應用安全測試的極限基準

摘要

企業客戶日益採用大型語言模型（LLMs）來處理關鍵的溝通任務，例如撰寫電子郵件、構思銷售提案以及編寫非正式訊息。要在不同地區部署此類模型，必須使其理解多元的文化與語言背景，並生成安全且得體的回應。對於企業應用而言，有效識別並處理不安全或冒犯性語言，以減輕聲譽風險、維護信任並確保合規性，至關重要。為此，我們推出了SweEval，這是一個模擬現實情境的基準測試，涵蓋語氣（正面或負面）與語境（正式或非正式）的變化。提示語明確指示模型在完成任務時包含特定的粗俗詞彙。此基準測試旨在評估LLMs是否遵循或抵制此類不當指令，並檢驗其與倫理框架、文化細微差異及語言理解能力的契合度。為推動構建符合倫理的人工智慧系統的研究，無論是企業應用還是更廣泛的領域，我們公開了數據集與程式碼：https://github.com/amitbcp/multilingual_profanity。

English

Enterprise customers are increasingly adopting Large Language Models (LLMs) for critical communication tasks, such as drafting emails, crafting sales pitches, and composing casual messages. Deploying such models across different regions requires them to understand diverse cultural and linguistic contexts and generate safe and respectful responses. For enterprise applications, it is crucial to mitigate reputational risks, maintain trust, and ensure compliance by effectively identifying and handling unsafe or offensive language. To address this, we introduce SweEval, a benchmark simulating real-world scenarios with variations in tone (positive or negative) and context (formal or informal). The prompts explicitly instruct the model to include specific swear words while completing the task. This benchmark evaluates whether LLMs comply with or resist such inappropriate instructions and assesses their alignment with ethical frameworks, cultural nuances, and language comprehension capabilities. In order to advance research in building ethically aligned AI systems for enterprise use and beyond, we release the dataset and code: https://github.com/amitbcp/multilingual_profanity.

SweEval：大型語言模型真的會說髒話嗎？企業應用安全測試的極限基準

SweEval: Do LLMs Really Swear? A Safety Benchmark for Testing Limits for Enterprise Use

摘要

Support