PsychoSafe: 大規模言語モデルにおける心理学に基づいた拒否の誘発

要旨

大規模言語モデル（LLM）は日常的に、拒否すべきリクエストに直面し、有用性と害防止の間のトレードオフを生み出している。しかし、拒否そのものが有用であることもある。危機、強制、または意図のエスカレーションを伴う高リスクなインタラクションにおいて、単純な不遵守は直接的な害を防ぐ一方で、リクエストの背後にいる人物のニーズを支援できない可能性がある。本稿では、PsychoSafeを提示する。これは、エビデンスに基づく介入戦略に根ざした構造化された支援的コミュニケーションとして拒否を再定義する、心理学に基づく拒否フレームワークである。PsychoSafeを開発するために、心理学的に重要な5つのリスク領域にわたる8019のプロンプトと応答のペアからなるコーパスを構築し、Qwen 3.5 27Bに対してプロンプティングとパラメータ効率的なファインチューニングを適用した。500のプロンプトからなるバランスの取れた検証セットにおいて、LLM判定器による評価と人間による評価を通じて検証した結果、PsychoSafeのプロンプティングは、一般的なベースラインと比較して拒否品質を全体的に28.1%向上させ、特に外部リソースの紹介（+46.8%）と心理学的根拠付け（+34.8%）において顕著な改善を示した。一方で、非拒否タスクにおける下流性能は維持された。ファインチューニングにより、拒否率とリソース紹介率はほぼ完璧に達したが、応答の関連性は低下した。SORRY-BenchおよびXSTestでの追加評価では、ドメイン内では高いロバスト性を示したものの、ドメイン外への汎化は限定的であり、今後の研究ではファインチューニングデータを多様化し、モデルが介入を図式的ではなく選択的に適用できるようにする必要があることが示唆された。

English

Large language models (LLMs) routinely face requests that should be refused, creating a trade-off between helpfulness and harm prevention. However, refusals themselves can be helpful. In high-risk interactions involving crisis, coercion, or escalating intent, blunt non-compliance may prevent direct harm while still failing to support the needs of the person behind the request. We present PsychoSafe, a psychologically-informed refusal framework that reframes refusal as structured supportive communication grounded in evidence-based intervention strategies. To develop PsychoSafe, we construct a corpus of 8019 prompt-response pairs spanning five psychologically salient risk domains and apply prompting and parameter-efficient fine-tuning to Qwen 3.5 27B. On a balanced validation set of 500 prompts, evaluated with an LLM judge and validated through human ratings, PsychoSafe prompting improves overall refusal quality by 28.1% over a generic baseline, with particularly strong gains in external resource referral (+46.8%) and psychological grounding (+34.8%), while preserving downstream performance on non-refusal tasks. Fine-tuning achieves near-perfect refusal and resource-referral rates but reduces response relevance. Additional evaluations on SORRY-Bench and XSTest show strong in-domain robustness but limited out-of-domain generalization, suggesting that future work should diversify fine-tuning data to help models apply interventions selectively rather than schematically.