分散式训练后门攻击研究

摘要

分散式大型语言模型後訓練採用數據與管道並行技術，實現數據和模型的拆分。然而這種分散式後訓練模式容易遭受單個或多個惡意參與者發動的中毒攻擊與後門攻擊。目前已有若干研究針對分散式數據並行或聯邦學習提出攻防方案，但現有關於管道並行魯棒性的研究僅限於中毒攻擊範疇。據我們所知，本文首次提出針對管道並行的後門攻擊方案，旨在誘導訓練後的模型產生行為偏差。在我們的設定中，攻擊者僅控制管道的某個中間階段而非完整模型或數據集，這使得數據中毒等現有攻擊手段無法適用。實驗結果表明，即便受此局限，攻擊者仍能在後訓練階段注入後門並導致模型行為失準，且該效果與學習領域或數據集無關。通過觸發詞觸發攻擊時，模型對齊率從80%降至6%。我們進一步對最終模型施加安全對齊訓練以驗證攻擊魯棒性，結果顯示該後門攻擊在60%的案例中依然成功。

English

Decentralised post-training of large language models utilises data and pipeline parallelism techniques to split the data and the model. Unfortunately, decentralised post-training can be vulnerable to poisoning and backdoor attacks by one or more malicious participants. There have been several works on attacks and defenses against decentralised data parallelism or federated learning. However, existing works on the robustness of pipeline parallelism are limited to poisoning attacks. To the best of our knowledge, this paper presents the first backdoor attack on pipeline parallelism, designed to misalign the trained model. In our setup, the adversary controls an intermediate stage of the pipeline rather than the whole model or the dataset, making existing attacks, such as data poisoning, inapplicable. Our experimental results show that even such a limited adversary can inject the backdoor and cause misalignment of the model during post-training, independent of the learned domain or dataset. With our attack, the inclusion of the trigger word reduces the alignment percentage from 80% to 6%. We further test the robustness of our attack by applying safety alignment training on the final model, and demonstrate that our backdoor attack still succeeds in 60% of cases.

分散式训练后门攻击研究

Backdoor Attacks on Decentralised Post-Training

摘要

Support