LLM은 위험한 설득자가 될 수 있다: 대규모 언어 모델의 설득 안전성에 대한 실증적 연구

초록

대규모 언어 모델(LLM)의 최근 발전은 인간 수준의 설득 능력에 접근할 수 있게 했습니다. 그러나 이러한 잠재력은 특히 조작, 기만, 취약성 악용 및 기타 여러 유해한 전술을 통한 비윤리적 영향의 가능성과 같은 LLM 기반 설득의 안전 위험에 대한 우려를 제기합니다. 본 연구에서는 두 가지 중요한 측면을 통해 LLM 설득 안전성에 대한 체계적인 조사를 제시합니다: (1) LLM이 비윤리적 설득 작업을 적절히 거부하고 실행 중에 비윤리적 전략을 피하는지, 초기 설득 목표가 윤리적으로 중립적으로 보이는 경우를 포함하여, (2) 성격 특성 및 외부 압력과 같은 영향 요인이 그들의 행동에 어떻게 영향을 미치는지. 이를 위해 우리는 설득 장면 생성, 설득적 대화 시뮬레이션, 설득 안전성 평가의 세 단계로 구성된 최초의 포괄적인 설득 안전성 평가 프레임워크인 PersuSafety를 소개합니다. PersuSafety는 6가지 다양한 비윤리적 설득 주제와 15가지 일반적인 비윤리적 전략을 다룹니다. 널리 사용되는 8개의 LLM에 걸친 광범위한 실험을 통해 우리는 대부분의 LLM에서 유해한 설득 작업을 식별하지 못하고 다양한 비윤리적 설득 전략을 활용하는 등 상당한 안전 문제를 관찰했습니다. 우리의 연구는 설득과 같은 점진적이고 목표 지향적인 대화에서 안전성 정렬을 개선하기 위한 더 많은 관심을 촉구합니다.

English

Recent advancements in Large Language Models (LLMs) have enabled them to approach human-level persuasion capabilities. However, such potential also raises concerns about the safety risks of LLM-driven persuasion, particularly their potential for unethical influence through manipulation, deception, exploitation of vulnerabilities, and many other harmful tactics. In this work, we present a systematic investigation of LLM persuasion safety through two critical aspects: (1) whether LLMs appropriately reject unethical persuasion tasks and avoid unethical strategies during execution, including cases where the initial persuasion goal appears ethically neutral, and (2) how influencing factors like personality traits and external pressures affect their behavior. To this end, we introduce PersuSafety, the first comprehensive framework for the assessment of persuasion safety which consists of three stages, i.e., persuasion scene creation, persuasive conversation simulation, and persuasion safety assessment. PersuSafety covers 6 diverse unethical persuasion topics and 15 common unethical strategies. Through extensive experiments across 8 widely used LLMs, we observe significant safety concerns in most LLMs, including failing to identify harmful persuasion tasks and leveraging various unethical persuasion strategies. Our study calls for more attention to improve safety alignment in progressive and goal-driven conversations such as persuasion.

LLM은 위험한 설득자가 될 수 있다: 대규모 언어 모델의 설득 안전성에 대한 실증적 연구

LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models

초록

Support