Persuasiedynamiek in LLM's: Onderzoek naar robuustheid en aanpassingsvermogen in kennis en veiligheid met DuET-PD

Samenvatting

Grote Taalmodellen (LLMs) kunnen moeite hebben om een balans te vinden tussen geloofwaardigheid voor misinformatie en weerstand tegen geldige correcties in overtuigende dialogen, een cruciale uitdaging voor betrouwbare inzet. We introduceren DuET-PD (Dual Evaluation for Trust in Persuasive Dialogues), een raamwerk dat de dynamiek van standpuntverandering over meerdere beurten evalueert langs twee dimensies: overtuigingstype (corrigerend/misleidend) en domein (kennis via MMLU-Pro, en veiligheid via SALAD-Bench). We ontdekken dat zelfs een state-of-the-art model zoals GPT-4o slechts 27,32% nauwkeurigheid behaalt in MMLU-Pro onder aanhoudende misleidende overtuigingen. Bovendien onthullen de resultaten een zorgwekkende trend van toenemende sycophantie in nieuwere open-source modellen. Om dit aan te pakken, introduceren we Holistic DPO, een trainingsbenadering die positieve en negatieve overtuigingsvoorbeelden in balans brengt. In tegenstelling tot prompting of alleen weerstandstraining, verbetert Holistic DPO zowel de robuustheid tegen misinformatie als de ontvankelijkheid voor correcties, waardoor de nauwkeurigheid van Llama-3.1-8B-Instruct onder misleidende overtuiging in veiligheidscontexten stijgt van 4,21% naar 76,54%. Deze bijdragen bieden een weg naar de ontwikkeling van betrouwbaardere en aanpasbaardere LLMs voor dialogen over meerdere beurten. Code is beschikbaar op https://github.com/Social-AI-Studio/DuET-PD.

English

Large Language Models (LLMs) can struggle to balance gullibility to misinformation and resistance to valid corrections in persuasive dialogues, a critical challenge for reliable deployment. We introduce DuET-PD (Dual Evaluation for Trust in Persuasive Dialogues), a framework evaluating multi-turn stance-change dynamics across dual dimensions: persuasion type (corrective/misleading) and domain (knowledge via MMLU-Pro, and safety via SALAD-Bench). We find that even a state-of-the-art model like GPT-4o achieves only 27.32% accuracy in MMLU-Pro under sustained misleading persuasions. Moreover, results reveal a concerning trend of increasing sycophancy in newer open-source models. To address this, we introduce Holistic DPO, a training approach balancing positive and negative persuasion examples. Unlike prompting or resist-only training, Holistic DPO enhances both robustness to misinformation and receptiveness to corrections, improving Llama-3.1-8B-Instruct's accuracy under misleading persuasion in safety contexts from 4.21% to 76.54%. These contributions offer a pathway to developing more reliable and adaptable LLMs for multi-turn dialogue. Code is available at https://github.com/Social-AI-Studio/DuET-PD.

Persuasiedynamiek in LLM's: Onderzoek naar robuustheid en aanpassingsvermogen in kennis en veiligheid met DuET-PD

Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD

Samenvatting

Support