训练模型理解（而非生成）高风险数据

摘要

语言模型开发者通常会在预训练数据中过滤掉高风险内容——如毒性或受版权保护的文本——以防止模型生成类似输出。然而，完全移除此类数据会限制模型识别并恰当应对有害或敏感内容的能力。本文提出了一种名为“选择性损失以理解但不生成”（SLUNG）的预训练范式，使模型能够学习理解高风险数据而不学习生成它。SLUNG并非统一应用下一词预测损失，而是选择性地避免激励高风险词元的生成，同时确保它们保留在模型的上下文窗口内。当模型学习预测高风险内容后的低风险词元时，它被迫理解高风险内容。通过实验，我们证明SLUNG持续提升了模型对高风险数据的理解能力（例如，识别毒性内容的能力），而不会增加其生成（例如，模型回答的毒性）。总体而言，我们的SLUNG范式使模型能够从原本会被过滤掉的高风险文本中获益。

English

Language model developers typically filter out high-risk content -- such as toxic or copyrighted text -- from their pre-training data to prevent models from generating similar outputs. However, removing such data altogether limits models' ability to recognize and appropriately respond to harmful or sensitive content. In this paper, we introduce Selective Loss to Understand but Not Generate (SLUNG), a pre-training paradigm through which models learn to understand high-risk data without learning to generate it. Instead of uniformly applying the next-token prediction loss, SLUNG selectively avoids incentivizing the generation of high-risk tokens while ensuring they remain within the model's context window. As the model learns to predict low-risk tokens that follow high-risk ones, it is forced to understand the high-risk content. Through our experiments, we show that SLUNG consistently improves models' understanding of high-risk data (e.g., ability to recognize toxic content) without increasing its generation (e.g., toxicity of model responses). Overall, our SLUNG paradigm enables models to benefit from high-risk text that would otherwise be filtered out.

训练模型理解（而非生成）高风险数据

Teaching Models to Understand (but not Generate) High-risk Data

摘要

Support