모두에게 동일한 안전 기준이 적용될까? 대규모 언어 모델의 사용자 맞춤형 안전성 평가

초록

대규모 언어 모델(LLM) 에이전트의 사용이 증가함에 따라, 그들의 안전 취약점이 점점 더 분명해지고 있습니다. 다양한 벤치마크가 LLM 안전의 여러 측면을 평가하지만, 이는 주로 일반적인 표준에 의존하여 사용자별 표준을 간과하는 경향이 있습니다. 그러나 LLM의 안전 표준은 모든 사용자에게 일관되게 적용되는 것이 아니라 사용자별 프로필에 따라 달라질 수 있습니다. 이는 중요한 연구 질문을 제기합니다: 사용자별 안전 표준을 고려할 때 LLM 에이전트는 안전하게 행동하는가? 안전한 LLM 사용에 있어 이 문제의 중요성에도 불구하고, 현재까지 사용자별 LLM 안전성을 평가할 수 있는 벤치마크 데이터셋이 존재하지 않습니다. 이러한 격차를 해결하기 위해, 우리는 사용자별 LLM 안전성을 평가하기 위해 설계된 첫 번째 벤치마크인 U-SAFEBENCH를 소개합니다. 우리는 18개의 널리 사용되는 LLM을 평가한 결과, 현재의 LLM들이 사용자별 안전 표준을 고려할 때 안전하게 행동하지 못한다는 새로운 발견을 했습니다. 이 취약점을 해결하기 위해, 우리는 체인 오브 사고(chain-of-thought) 기반의 간단한 해결책을 제안하고, 이를 통해 사용자별 안전성을 개선하는 데 효과적임을 입증했습니다. 우리의 벤치마크와 코드는 https://github.com/yeonjun-in/U-SafeBench에서 확인할 수 있습니다.

English

As the use of large language model (LLM) agents continues to grow, their safety vulnerabilities have become increasingly evident. Extensive benchmarks evaluate various aspects of LLM safety by defining the safety relying heavily on general standards, overlooking user-specific standards. However, safety standards for LLM may vary based on a user-specific profiles rather than being universally consistent across all users. This raises a critical research question: Do LLM agents act safely when considering user-specific safety standards? Despite its importance for safe LLM use, no benchmark datasets currently exist to evaluate the user-specific safety of LLMs. To address this gap, we introduce U-SAFEBENCH, the first benchmark designed to assess user-specific aspect of LLM safety. Our evaluation of 18 widely used LLMs reveals current LLMs fail to act safely when considering user-specific safety standards, marking a new discovery in this field. To address this vulnerability, we propose a simple remedy based on chain-of-thought, demonstrating its effectiveness in improving user-specific safety. Our benchmark and code are available at https://github.com/yeonjun-in/U-SafeBench.

모두에게 동일한 안전 기준이 적용될까? 대규모 언어 모델의 사용자 맞춤형 안전성 평가

Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models

초록

Support