「ノー」と言う技術：言語モデルにおける文脈的拒否行動

要旨

チャットベースの言語モデルはユーザーを支援するように設計されていますが、すべてのユーザーリクエストに従うべきではありません。既存研究の多くは主に「安全でない」クエリの拒否に焦点を当てていますが、私たちは非遵守の範囲を広げるべきだと主張します。本論文では、モデルがユーザーリクエストに従うべきでない状況と方法を記述した、文脈に基づく非遵守の包括的分類体系を導入します。この分類体系は、不完全、未サポート、不確定、人間らしさを求めるリクエスト（安全でないリクエストに加えて）など、幅広いカテゴリを網羅しています。言語モデルの非遵守能力をテストするため、この分類体系を用いて1000の非遵守プロンプトからなる新しい評価スイートを開発しました。その結果、GPT-4のようなモデルが、これまで十分に研究されていなかった特定のカテゴリにおいて最大30％ものリクエストに誤って従ってしまうなど、既存モデルの多くが著しく高い遵守率を示すことがわかりました。これらのギャップに対処するため、合成生成されたリクエストと期待される非遵守応答からなるトレーニングセットを使用し、さまざまなトレーニング戦略を探求しました。実験結果から、指示チューニング済みモデルの直接的なファインチューニングは過剰な拒否や一般的な能力の低下を招く可能性がある一方、低ランクアダプターのようなパラメータ効率の良い手法を使用することで、適切な非遵守とその他の能力のバランスをうまく取れることが示されました。

English

Chat-based language models are designed to be helpful, yet they should not comply with every user request. While most existing work primarily focuses on refusal of "unsafe" queries, we posit that the scope of noncompliance should be broadened. We introduce a comprehensive taxonomy of contextual noncompliance describing when and how models should not comply with user requests. Our taxonomy spans a wide range of categories including incomplete, unsupported, indeterminate, and humanizing requests (in addition to unsafe requests). To test noncompliance capabilities of language models, we use this taxonomy to develop a new evaluation suite of 1000 noncompliance prompts. We find that most existing models show significantly high compliance rates in certain previously understudied categories with models like GPT-4 incorrectly complying with as many as 30% of requests. To address these gaps, we explore different training strategies using a synthetically-generated training set of requests and expected noncompliant responses. Our experiments demonstrate that while direct finetuning of instruction-tuned models can lead to both over-refusal and a decline in general capabilities, using parameter efficient methods like low rank adapters helps to strike a good balance between appropriate noncompliance and other capabilities.

「ノー」と言う技術：言語モデルにおける文脈的拒否行動

The Art of Saying No: Contextual Noncompliance in Language Models

要旨

Support