PsychoSafe:在大語言模型中引發基於心理學的拒絕回應
PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models
June 8, 2026
作者: Gianluca Barmina, Federico Torrielli, Sven Harms, Jacob Nielsen, Felix Mächtle, Stine Lyngsø Beltoft, Peter Schneider-Kamp, Thomas Eisenbarth, Lukas Galke Poech, Anne Lauscher
cs.AI
摘要
大型語言模型(LLMs)經常面臨需要拒絕的請求,這使得「助益性」與「防止傷害」之間存在取捨。然而,拒絕本身也可能具有助益性。在涉及危機、脅迫或意圖升級等高風險互動中,生硬的不配合雖能避免直接傷害,卻仍未能支援請求背後使用者的需求。我們提出PsychoSafe——一套基於心理學的拒絕框架,將拒絕重新定義為結構化的支持性溝通,並奠基於實證支持的介入策略。為開發PsychoSafe,我們建構了一個包含8019組提示-回應對的語料庫,涵蓋五個具有心理顯著性的風險領域,並對Qwen 3.5 27B模型應用了提示設計與參數高效微調。在一個由500組提示組成的平衡驗證集上,經由LLM評估器評分並透過人類評分驗證,PsychoSafe提示相較於一般基準,整體拒絕品質提升了28.1%,其中在外部資源轉介(+46.8%)與心理學基礎(+34.8%)方面表現尤為突出,同時保留了非拒絕任務的後續表現。微調模型達到了近乎完美的拒絕率與資源轉介率,但降低了回應的相關性。此外,在SORRY-Bench與XSTest上的評估顯示,模型在領域內具有強健性,但在領域外的一般化能力有限,顯示未來研究應多樣化微調資料,以協助模型選擇性地而非制式地應用介入策略。
English
Large language models (LLMs) routinely face requests that should be refused, creating a trade-off between helpfulness and harm prevention. However, refusals themselves can be helpful. In high-risk interactions involving crisis, coercion, or escalating intent, blunt non-compliance may prevent direct harm while still failing to support the needs of the person behind the request. We present PsychoSafe, a psychologically-informed refusal framework that reframes refusal as structured supportive communication grounded in evidence-based intervention strategies. To develop PsychoSafe, we construct a corpus of 8019 prompt-response pairs spanning five psychologically salient risk domains and apply prompting and parameter-efficient fine-tuning to Qwen 3.5 27B. On a balanced validation set of 500 prompts, evaluated with an LLM judge and validated through human ratings, PsychoSafe prompting improves overall refusal quality by 28.1% over a generic baseline, with particularly strong gains in external resource referral (+46.8%) and psychological grounding (+34.8%), while preserving downstream performance on non-refusal tasks. Fine-tuning achieves near-perfect refusal and resource-referral rates but reduces response relevance. Additional evaluations on SORRY-Bench and XSTest show strong in-domain robustness but limited out-of-domain generalization, suggesting that future work should diversify fine-tuning data to help models apply interventions selectively rather than schematically.