分散式后训练中的后门攻击

摘要

基于去中心化数据并行与流水线并行技术，大语言模型的分布式后训练实现了数据和模型的分割处理。然而这种去中心化后训练模式容易遭受单个或多个恶意参与者的投毒攻击与后门攻击。目前已有若干研究针对去中心化数据并行或联邦学习的攻防机制展开探讨，但现有关于流水线并行鲁棒性的研究仍局限于投毒攻击范畴。据我们所知，本文首次提出了针对流水线并行的后门攻击方案，旨在诱导训练后的模型产生行为偏差。在我们的设定中，攻击者仅控制流水线的中间阶段而非整个模型或数据集，这使得数据投毒等传统攻击手段失效。实验结果表明，即使受限于局部攻击能力，攻击者仍能在后训练阶段成功植入后门并导致模型行为失准，且该攻击效果与所学领域或数据集无关。通过实施我们的攻击，触发词的引入使模型对齐率从80%降至6%。我们进一步通过对最终模型施加安全对齐训练来验证攻击鲁棒性，实验证明该后门攻击在60%的案例中依然有效。

English

Decentralised post-training of large language models utilises data and pipeline parallelism techniques to split the data and the model. Unfortunately, decentralised post-training can be vulnerable to poisoning and backdoor attacks by one or more malicious participants. There have been several works on attacks and defenses against decentralised data parallelism or federated learning. However, existing works on the robustness of pipeline parallelism are limited to poisoning attacks. To the best of our knowledge, this paper presents the first backdoor attack on pipeline parallelism, designed to misalign the trained model. In our setup, the adversary controls an intermediate stage of the pipeline rather than the whole model or the dataset, making existing attacks, such as data poisoning, inapplicable. Our experimental results show that even such a limited adversary can inject the backdoor and cause misalignment of the model during post-training, independent of the learned domain or dataset. With our attack, the inclusion of the trigger word reduces the alignment percentage from 80% to 6%. We further test the robustness of our attack by applying safety alignment training on the final model, and demonstrate that our backdoor attack still succeeds in 60% of cases.

分散式后训练中的后门攻击

Backdoor Attacks on Decentralised Post-Training

摘要

Support