PsychoSafe：在大語言模型中引發基於心理學的拒絕回應

摘要

大型語言模型（LLMs）經常面臨需要拒絕的請求，這使得「助益性」與「防止傷害」之間存在取捨。然而，拒絕本身也可能具有助益性。在涉及危機、脅迫或意圖升級等高風險互動中，生硬的不配合雖能避免直接傷害，卻仍未能支援請求背後使用者的需求。我們提出PsychoSafe——一套基於心理學的拒絕框架，將拒絕重新定義為結構化的支持性溝通，並奠基於實證支持的介入策略。為開發PsychoSafe，我們建構了一個包含8019組提示-回應對的語料庫，涵蓋五個具有心理顯著性的風險領域，並對Qwen 3.5 27B模型應用了提示設計與參數高效微調。在一個由500組提示組成的平衡驗證集上，經由LLM評估器評分並透過人類評分驗證，PsychoSafe提示相較於一般基準，整體拒絕品質提升了28.1%，其中在外部資源轉介（+46.8%）與心理學基礎（+34.8%）方面表現尤為突出，同時保留了非拒絕任務的後續表現。微調模型達到了近乎完美的拒絕率與資源轉介率，但降低了回應的相關性。此外，在SORRY-Bench與XSTest上的評估顯示，模型在領域內具有強健性，但在領域外的一般化能力有限，顯示未來研究應多樣化微調資料，以協助模型選擇性地而非制式地應用介入策略。

English

Large language models (LLMs) routinely face requests that should be refused, creating a trade-off between helpfulness and harm prevention. However, refusals themselves can be helpful. In high-risk interactions involving crisis, coercion, or escalating intent, blunt non-compliance may prevent direct harm while still failing to support the needs of the person behind the request. We present PsychoSafe, a psychologically-informed refusal framework that reframes refusal as structured supportive communication grounded in evidence-based intervention strategies. To develop PsychoSafe, we construct a corpus of 8019 prompt-response pairs spanning five psychologically salient risk domains and apply prompting and parameter-efficient fine-tuning to Qwen 3.5 27B. On a balanced validation set of 500 prompts, evaluated with an LLM judge and validated through human ratings, PsychoSafe prompting improves overall refusal quality by 28.1% over a generic baseline, with particularly strong gains in external resource referral (+46.8%) and psychological grounding (+34.8%), while preserving downstream performance on non-refusal tasks. Fine-tuning achieves near-perfect refusal and resource-referral rates but reduces response relevance. Additional evaluations on SORRY-Bench and XSTest show strong in-domain robustness but limited out-of-domain generalization, suggesting that future work should diversify fine-tuning data to help models apply interventions selectively rather than schematically.