分散型学習後処理におけるバックドア攻撃

要旨

分散型大規模言語モデルの事後学習では、データ並列処理とパイプライン並列処理の技術を活用し、データとモデルを分割する。しかしながら、分散型事後学習は、1つまたは複数の悪意ある参加者によるポイズニング攻撃やバックドア攻撃に対して脆弱となる可能性がある。分散型データ並列処理やフェデレーテッドラーニングに対する攻撃と防御については、これまでにいくつかの研究がなされている。しかし、パイプライン並列処理の堅牢性に関する既存研究は、ポイズニング攻撃に限定されている。我々の知る限り、本論文は、学習済みモデルの意図しない動作を引き起こす、パイプライン並列処理に対する初めてのバックドア攻撃を提示する。本設定では、敵対者はモデル全体やデータセット全体ではなく、パイプラインの中間段階を制御するため、データポイズニングなどの既存の攻撃手法は適用できない。実験結果により、このように制限された敵対者であっても、事後学習過程中にバックドアを注入し、モデルの動作異常を引き起こせることが示された。これは、学習対象のドメインやデータセットに依存しない。本攻撃では、トリガーワードを含めることで、整合性の割合が80%から6%に低下する。さらに、最終モデルに安全性調整訓練を適用して本攻撃の堅牢性を検証した結果、60%のケースで本バックドア攻撃が依然として成功することを実証した。

English

Decentralised post-training of large language models utilises data and pipeline parallelism techniques to split the data and the model. Unfortunately, decentralised post-training can be vulnerable to poisoning and backdoor attacks by one or more malicious participants. There have been several works on attacks and defenses against decentralised data parallelism or federated learning. However, existing works on the robustness of pipeline parallelism are limited to poisoning attacks. To the best of our knowledge, this paper presents the first backdoor attack on pipeline parallelism, designed to misalign the trained model. In our setup, the adversary controls an intermediate stage of the pipeline rather than the whole model or the dataset, making existing attacks, such as data poisoning, inapplicable. Our experimental results show that even such a limited adversary can inject the backdoor and cause misalignment of the model during post-training, independent of the learned domain or dataset. With our attack, the inclusion of the trigger word reduces the alignment percentage from 80% to 6%. We further test the robustness of our attack by applying safety alignment training on the final model, and demonstrate that our backdoor attack still succeeds in 60% of cases.

分散型学習後処理におけるバックドア攻撃

Backdoor Attacks on Decentralised Post-Training

要旨

Support