고위험 데이터를 생성하지 않고 이해하도록 모델을 가르치기

초록

언어 모델 개발자들은 일반적으로 모델이 유사한 출력을 생성하지 않도록 사전 학습 데이터에서 유해하거나 저작권이 있는 텍스트와 같은 고위험 콘텐츠를 필터링합니다. 그러나 이러한 데이터를 완전히 제거하면 모델이 유해하거나 민감한 콘텐츠를 인식하고 적절히 대응하는 능력이 제한됩니다. 본 논문에서는 모델이 고위험 데이터를 생성하지 않으면서도 이를 이해하도록 학습하는 사전 학습 패러다임인 SLUNG(Selective Loss to Understand but Not Generate)을 소개합니다. SLUNG은 다음 토큰 예측 손실을 균일하게 적용하는 대신, 고위험 토큰의 생성을 유도하지 않으면서도 이를 모델의 컨텍스트 창 내에 유지하도록 선택적으로 조정합니다. 모델이 고위험 토큰 뒤에 오는 저위험 토큰을 예측하도록 학습함에 따라, 고위험 콘텐츠를 이해하도록 강제됩니다. 실험을 통해 SLUNG이 모델의 고위험 데이터 이해 능력(예: 유해 콘텐츠 인식 능력)을 향상시키면서도 그 생성(예: 모델 응답의 유해성)을 증가시키지 않음을 보여줍니다. 전반적으로, SLUNG 패러다임은 필터링되었을 고위험 텍스트로부터 모델이 이점을 얻을 수 있도록 합니다.

English

Language model developers typically filter out high-risk content -- such as toxic or copyrighted text -- from their pre-training data to prevent models from generating similar outputs. However, removing such data altogether limits models' ability to recognize and appropriately respond to harmful or sensitive content. In this paper, we introduce Selective Loss to Understand but Not Generate (SLUNG), a pre-training paradigm through which models learn to understand high-risk data without learning to generate it. Instead of uniformly applying the next-token prediction loss, SLUNG selectively avoids incentivizing the generation of high-risk tokens while ensuring they remain within the model's context window. As the model learns to predict low-risk tokens that follow high-risk ones, it is forced to understand the high-risk content. Through our experiments, we show that SLUNG consistently improves models' understanding of high-risk data (e.g., ability to recognize toxic content) without increasing its generation (e.g., toxicity of model responses). Overall, our SLUNG paradigm enables models to benefit from high-risk text that would otherwise be filtered out.

고위험 데이터를 생성하지 않고 이해하도록 모델을 가르치기

Teaching Models to Understand (but not Generate) High-risk Data

초록

Support