분산형 학습 후 처리에 대한 백도어 공격

초록

대규모 언어 모델의 분산 사후 훈련은 데이터와 모델을 분할하기 위해 데이터 병렬화 및 파이프라인 병렬화 기술을 활용합니다. 그러나 분산 사후 훈련은 하나 이상의 악의적 참여자에 의한 포이즈닝 및 백도어 공격에 취약할 수 있습니다. 분산 데이터 병렬화나 연합 학습에 대한 공격 및 방어 기법을 다룬 여러 연구가 존재합니다. 하지만 파이프라인 병렬화의 견고성에 대한 기존 연구는 포이즈닝 공격에만 국한되어 있습니다. 저자들이 아는 한, 본 논문은 훈련된 모델의 정렬을 왜곡하도록 설계된 파이프라인 병렬화에 대한 최초의 백도어 공격을 제시합니다. 우리의 설정에서 공격자는 모델 전체나 데이터셋 대신 파이프라인의 중간 단계를 제어함으로써 데이터 포이즈닝과 같은 기존 공격 방식이 적용되지 않도록 합니다. 우리의 실험 결과에 따르면, 이처럼 제한된 공격자라도 사후 훈련 과정에서 백도어를 주입하고 모델의 정렬을 왜곡시킬 수 있으며, 이는 학습된 도메인이나 데이터셋과 무관합니다. 우리의 공격을 통해 트리거 단어를 포함시키면 정렬 비율이 80%에서 6%로 감소합니다. 또한 최종 모델에 안전성 정렬 훈련을 적용하여 공격의 견고성을 추가로 테스트한 결과, 우리의 백도어 공격이 60%의 경우에서 여전히 성공함을 입증합니다.

English

Decentralised post-training of large language models utilises data and pipeline parallelism techniques to split the data and the model. Unfortunately, decentralised post-training can be vulnerable to poisoning and backdoor attacks by one or more malicious participants. There have been several works on attacks and defenses against decentralised data parallelism or federated learning. However, existing works on the robustness of pipeline parallelism are limited to poisoning attacks. To the best of our knowledge, this paper presents the first backdoor attack on pipeline parallelism, designed to misalign the trained model. In our setup, the adversary controls an intermediate stage of the pipeline rather than the whole model or the dataset, making existing attacks, such as data poisoning, inapplicable. Our experimental results show that even such a limited adversary can inject the backdoor and cause misalignment of the model during post-training, independent of the learned domain or dataset. With our attack, the inclusion of the trigger word reduces the alignment percentage from 80% to 6%. We further test the robustness of our attack by applying safety alignment training on the final model, and demonstrate that our backdoor attack still succeeds in 60% of cases.

분산형 학습 후 처리에 대한 백도어 공격

Backdoor Attacks on Decentralised Post-Training

초록

Support