ChatPaper.aiChatPaper

PsychoSafe: 在大语言模型中引发基于心理学的拒绝行为

PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models

June 8, 2026
作者: Gianluca Barmina, Federico Torrielli, Sven Harms, Jacob Nielsen, Felix Mächtle, Stine Lyngsø Beltoft, Peter Schneider-Kamp, Thomas Eisenbarth, Lukas Galke Poech, Anne Lauscher
cs.AI

摘要

大型语言模型(LLMs)在日常处理请求时会频繁遇到需要拒绝的情形,这造成了有用性与有害性预防之间的权衡。然而,拒绝本身也能提供有益帮助。在涉及危机、胁迫或意图升级的高风险交互中,生硬的不服从虽可防止直接伤害,却仍未能支持请求背后用户的需求。我们提出PsychoSafe——一种基于心理学的拒绝框架,将拒绝重构为基于循证干预策略的结构化支持性沟通。为开发PsychoSafe,我们构建了一个包含8019个提示-响应对的语料库,覆盖五个心理学相关的风险领域,并应用提示工程和参数高效微调技术于Qwen 3.5 27B模型。在包含500个提示的平衡验证集上,经LLM评判员评估并通过人工评分验证,PsychoSafe的提示方法相较于通用基线将整体拒绝质量提升了28.1%,尤其在外部资源转介(+46.8%)和心理根基(+34.8%)方面表现突出,同时保持了下游非拒绝任务的性能。微调实现了近乎完美的拒绝率和资源转介率,但降低了回复相关性。在SORRY-Bench和XSTest上的额外评估显示出较强的域内鲁棒性,但域外泛化能力有限,这表明未来工作应多样化微调数据,以帮助模型更有选择性地而非公式化地应用干预策略。
English
Large language models (LLMs) routinely face requests that should be refused, creating a trade-off between helpfulness and harm prevention. However, refusals themselves can be helpful. In high-risk interactions involving crisis, coercion, or escalating intent, blunt non-compliance may prevent direct harm while still failing to support the needs of the person behind the request. We present PsychoSafe, a psychologically-informed refusal framework that reframes refusal as structured supportive communication grounded in evidence-based intervention strategies. To develop PsychoSafe, we construct a corpus of 8019 prompt-response pairs spanning five psychologically salient risk domains and apply prompting and parameter-efficient fine-tuning to Qwen 3.5 27B. On a balanced validation set of 500 prompts, evaluated with an LLM judge and validated through human ratings, PsychoSafe prompting improves overall refusal quality by 28.1% over a generic baseline, with particularly strong gains in external resource referral (+46.8%) and psychological grounding (+34.8%), while preserving downstream performance on non-refusal tasks. Fine-tuning achieves near-perfect refusal and resource-referral rates but reduces response relevance. Additional evaluations on SORRY-Bench and XSTest show strong in-domain robustness but limited out-of-domain generalization, suggesting that future work should diversify fine-tuning data to help models apply interventions selectively rather than schematically.