高リスクデータを理解（ただし生成しない）するためのモデル教育

要旨

言語モデルの開発者は通常、モデルが類似した出力を生成するのを防ぐため、有害なテキストや著作権保護されたテキストなどの高リスクコンテンツを事前学習データから除外します。しかし、そのようなデータを完全に除去することは、モデルが有害またはセンシティブなコンテンツを認識し、適切に対応する能力を制限します。本論文では、高リスクデータを生成せずに理解することを学ぶための事前学習パラダイムである「Selective Loss to Understand but Not Generate（SLUNG）」を提案します。SLUNGは、次のトークン予測損失を一律に適用するのではなく、高リスクトークンの生成を促さないように選択的に回避しつつ、それらをモデルのコンテキストウィンドウ内に留めます。モデルが高リスクトークンに続く低リスクトークンを予測することを学ぶことで、高リスクコンテンツを理解することを強制されます。実験を通じて、SLUNGがモデルの高リスクデータの理解能力（例：有害コンテンツの認識能力）を向上させつつ、その生成（例：モデル応答の毒性）を増加させないことを示します。全体として、SLUNGパラダイムは、除外されるであろう高リスクテキストからモデルが利益を得ることを可能にします。

English

Language model developers typically filter out high-risk content -- such as toxic or copyrighted text -- from their pre-training data to prevent models from generating similar outputs. However, removing such data altogether limits models' ability to recognize and appropriately respond to harmful or sensitive content. In this paper, we introduce Selective Loss to Understand but Not Generate (SLUNG), a pre-training paradigm through which models learn to understand high-risk data without learning to generate it. Instead of uniformly applying the next-token prediction loss, SLUNG selectively avoids incentivizing the generation of high-risk tokens while ensuring they remain within the model's context window. As the model learns to predict low-risk tokens that follow high-risk ones, it is forced to understand the high-risk content. Through our experiments, we show that SLUNG consistently improves models' understanding of high-risk data (e.g., ability to recognize toxic content) without increasing its generation (e.g., toxicity of model responses). Overall, our SLUNG paradigm enables models to benefit from high-risk text that would otherwise be filtered out.

高リスクデータを理解（ただし生成しない）するためのモデル教育

Teaching Models to Understand (but not Generate) High-risk Data

要旨

Support