심층 무지: 사전 학습 데이터 필터링이 개방형 가중치 LLM에 변조 방지 안전장치를 구축한다

초록

오픈 웨이트(Open-weight) AI 시스템은 향상된 투명성, 개방형 연구, 분산형 접근성 등 독특한 장점을 제공한다. 그러나 이러한 시스템은 가중치나 활성화를 수정함으로써 유해한 행동을 효율적으로 유도할 수 있는 변조 공격에 취약하다. 현재로서는 오픈 웨이트 모델 리스크 관리에 대한 견고한 과학적 체계가 아직 마련되지 않았다. 기존의 안전성 미세 조정 방법 및 기타 사후 훈련 기술은 수십 단계 이상의 적대적 미세 조정에 대항할 수 있도록 대형 언어 모델(LLM)을 강화하는 데 어려움을 겪고 있다. 본 논문에서는 훈련 데이터에서 이중 사용 주제에 관한 텍스트를 필터링함으로써 원치 않는 기능을 방지하고 더 강력한 변조 방지 안전장치로 활용할 수 있는지 조사한다. 우리는 확장 가능한 데이터 필터링을 위한 다단계 파이프라인을 소개하고, 이를 통해 LLM에서 생물 위협 대리 지식을 최소화하는 실현 가능하고 효과적인 방법을 제시한다. 우리는 6.9B 매개변수 모델을 처음부터 사전 훈련시켜, 최대 10,000단계와 3억 토큰의 생물 위협 관련 텍스트에 대한 적대적 미세 조정 공격에 상당한 저항성을 보임을 확인했다. 이는 기존의 사후 훈련 기준선을 한 차원 이상 능가하는 성과를 보였으며, 관련 없는 기능에는 어떠한 저하도 관찰되지 않았다. 그러나 필터링된 모델은 내재화된 위험 지식을 갖고 있지 않지만, 이러한 정보가 문맥상 제공될 경우(예: 검색 도구 보강을 통해) 여전히 이를 활용할 수 있음을 발견했다. 이는 심층 방어 접근 방식의 필요성을 보여준다. 전반적으로, 이러한 연구 결과는 오픈 웨이트 AI 시스템을 위한 방어 계층으로서 사전 훈련 데이터 큐레이션의 가능성을 입증하는 데 기여한다.

English

Open-weight AI systems offer unique benefits, including enhanced transparency, open research, and decentralized access. However, they are vulnerable to tampering attacks which can efficiently elicit harmful behaviors by modifying weights or activations. Currently, there is not yet a robust science of open-weight model risk management. Existing safety fine-tuning methods and other post-training techniques have struggled to make LLMs resistant to more than a few dozen steps of adversarial fine-tuning. In this paper, we investigate whether filtering text about dual-use topics from training data can prevent unwanted capabilities and serve as a more tamper-resistant safeguard. We introduce a multi-stage pipeline for scalable data filtering and show that it offers a tractable and effective method for minimizing biothreat proxy knowledge in LLMs. We pretrain multiple 6.9B-parameter models from scratch and find that they exhibit substantial resistance to adversarial fine-tuning attacks on up to 10,000 steps and 300M tokens of biothreat-related text -- outperforming existing post-training baselines by over an order of magnitude -- with no observed degradation to unrelated capabilities. However, while filtered models lack internalized dangerous knowledge, we find that they can still leverage such information when it is provided in context (e.g., via search tool augmentation), demonstrating a need for a defense-in-depth approach. Overall, these findings help to establish pretraining data curation as a promising layer of defense for open-weight AI systems.

심층 무지: 사전 학습 데이터 필터링이 개방형 가중치 LLM에 변조 방지 안전장치를 구축한다

Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

초록

Support