教學模型以理解（而非生成）高風險數據

摘要

語言模型開發者通常會從其預訓練數據中過濾掉高風險內容——例如有毒或受版權保護的文本——以防止模型生成類似輸出。然而，完全移除這些數據限制了模型識別並適當應對有害或敏感內容的能力。在本論文中，我們介紹了“選擇性損失以理解但不生成”（SLUNG），這是一種預訓練範式，通過該範式，模型學習理解高風險數據而不學習生成它。SLUNG並非統一應用下一個詞元預測損失，而是選擇性地避免激勵生成高風險詞元，同時確保它們保留在模型的上下文窗口內。當模型學習預測高風險詞元後的低風險詞元時，它被迫理解高風險內容。通過我們的實驗，我們展示了SLUNG持續提升了模型對高風險數據的理解能力（例如，識別有毒內容的能力），而沒有增加其生成（例如，模型回應的毒性）。總體而言，我們的SLUNG範式使模型能夠從原本會被過濾掉的高風險文本中受益。

English

Language model developers typically filter out high-risk content -- such as toxic or copyrighted text -- from their pre-training data to prevent models from generating similar outputs. However, removing such data altogether limits models' ability to recognize and appropriately respond to harmful or sensitive content. In this paper, we introduce Selective Loss to Understand but Not Generate (SLUNG), a pre-training paradigm through which models learn to understand high-risk data without learning to generate it. Instead of uniformly applying the next-token prediction loss, SLUNG selectively avoids incentivizing the generation of high-risk tokens while ensuring they remain within the model's context window. As the model learns to predict low-risk tokens that follow high-risk ones, it is forced to understand the high-risk content. Through our experiments, we show that SLUNG consistently improves models' understanding of high-risk data (e.g., ability to recognize toxic content) without increasing its generation (e.g., toxicity of model responses). Overall, our SLUNG paradigm enables models to benefit from high-risk text that would otherwise be filtered out.

教學模型以理解（而非生成）高風險數據

Teaching Models to Understand (but not Generate) High-risk Data

摘要

Support