LLM의 설득 역학: DuET-PD를 활용한 지식과 안전성의 견고성 및 적응성 연구

초록

대규모 언어 모델(LLMs)은 설득적 대화에서 오정보에 대한 순진함과 유효한 수정에 대한 저항 사이의 균형을 맞추는 데 어려움을 겪을 수 있으며, 이는 신뢰할 수 있는 배포를 위한 중요한 과제입니다. 우리는 DuET-PD(설득적 대화에서의 신뢰를 위한 이중 평가)를 소개합니다. 이 프레임워크는 이중 차원(수정적/오도적 설득 유형 및 MMLU-Pro를 통한 지식, SALAD-Bench를 통한 안전 도메인)에 걸쳐 다중 턴 입장 변화 역학을 평가합니다. 우리는 GPT-4o와 같은 최첨단 모델도 지속적인 오도적 설득 하에서 MMLU-Pro에서 단 27.32%의 정확도를 달성한다는 사실을 발견했습니다. 더욱이, 결과는 최신 오픈소스 모델에서 점점 증가하는 아첨 경향을 보여줍니다. 이를 해결하기 위해 우리는 긍정적 및 부정적 설득 예제를 균형 있게 다루는 훈련 접근법인 Holistic DPO를 도입했습니다. 프롬프팅이나 저항만을 위한 훈련과 달리, Holistic DPO는 오정보에 대한 견고성과 수정에 대한 수용성을 모두 향상시켜, Llama-3.1-8B-Instruct의 안전 맥락에서 오도적 설득 하의 정확도를 4.21%에서 76.54%로 개선했습니다. 이러한 기여는 다중 턴 대화를 위한 더 신뢰할 수 있고 적응 가능한 LLMs 개발을 위한 길을 제시합니다. 코드는 https://github.com/Social-AI-Studio/DuET-PD에서 확인할 수 있습니다.

English

Large Language Models (LLMs) can struggle to balance gullibility to misinformation and resistance to valid corrections in persuasive dialogues, a critical challenge for reliable deployment. We introduce DuET-PD (Dual Evaluation for Trust in Persuasive Dialogues), a framework evaluating multi-turn stance-change dynamics across dual dimensions: persuasion type (corrective/misleading) and domain (knowledge via MMLU-Pro, and safety via SALAD-Bench). We find that even a state-of-the-art model like GPT-4o achieves only 27.32% accuracy in MMLU-Pro under sustained misleading persuasions. Moreover, results reveal a concerning trend of increasing sycophancy in newer open-source models. To address this, we introduce Holistic DPO, a training approach balancing positive and negative persuasion examples. Unlike prompting or resist-only training, Holistic DPO enhances both robustness to misinformation and receptiveness to corrections, improving Llama-3.1-8B-Instruct's accuracy under misleading persuasion in safety contexts from 4.21% to 76.54%. These contributions offer a pathway to developing more reliable and adaptable LLMs for multi-turn dialogue. Code is available at https://github.com/Social-AI-Studio/DuET-PD.