SweEval: LLM은 정말로 욕설을 사용하는가? 기업용 안전성 테스트를 위한 벤치마크

초록

기업 고객들은 이메일 초안 작성, 영업 프레젠테이션 구성, 캐주얼 메시지 작성과 같은 중요한 커뮤니케이션 작업에 대형 언어 모델(Large Language Models, LLMs)을 점점 더 많이 도입하고 있습니다. 이러한 모델을 다양한 지역에 배포하기 위해서는 다양한 문화적 및 언어적 맥락을 이해하고 안전하며 존중하는 응답을 생성할 수 있어야 합니다. 기업 애플리케이션의 경우, 명성을 보호하고 신뢰를 유지하며, 안전하지 않거나 공격적인 언어를 효과적으로 식별하고 처리함으로써 규정 준수를 보장하는 것이 중요합니다. 이를 해결하기 위해, 우리는 긍정적 또는 부정적인 어조와 공식적 또는 비공식적인 맥락의 변화를 포함한 실제 시나리오를 시뮬레이션하는 벤치마크인 SweEval을 소개합니다. 이 벤치마크는 작업을 완료하는 동안 특정 욕설을 포함하도록 모델에 명시적으로 지시합니다. 이 벤치마크는 LLMs가 이러한 부적절한 지시를 준수하는지 아니면 거부하는지를 평가하고, 윤리적 프레임워크, 문화적 뉘앙스, 언어 이해 능력과의 일치도를 평가합니다. 기업용 및 그 이상의 윤리적으로 정렬된 AI 시스템 구축 연구를 발전시키기 위해, 우리는 데이터셋과 코드를 공개합니다: https://github.com/amitbcp/multilingual_profanity.

English

Enterprise customers are increasingly adopting Large Language Models (LLMs) for critical communication tasks, such as drafting emails, crafting sales pitches, and composing casual messages. Deploying such models across different regions requires them to understand diverse cultural and linguistic contexts and generate safe and respectful responses. For enterprise applications, it is crucial to mitigate reputational risks, maintain trust, and ensure compliance by effectively identifying and handling unsafe or offensive language. To address this, we introduce SweEval, a benchmark simulating real-world scenarios with variations in tone (positive or negative) and context (formal or informal). The prompts explicitly instruct the model to include specific swear words while completing the task. This benchmark evaluates whether LLMs comply with or resist such inappropriate instructions and assesses their alignment with ethical frameworks, cultural nuances, and language comprehension capabilities. In order to advance research in building ethically aligned AI systems for enterprise use and beyond, we release the dataset and code: https://github.com/amitbcp/multilingual_profanity.

SweEval: LLM은 정말로 욕설을 사용하는가? 기업용 안전성 테스트를 위한 벤치마크

SweEval: Do LLMs Really Swear? A Safety Benchmark for Testing Limits for Enterprise Use

초록

Support