ProtegoFed：毒データを散りばめたバックドアフリー連合命令チューニング

要旨

フェデレーテッド・インストラクションチューニング（FIT）は、複数の組織（クライアント）間でプライベートな指示データを共有することなく、クロスシロ設定において大規模言語モデルの協調的インストラクションチューニングを可能にする。自然バックドアに関する最近の知見と既存の訓練データ収集方法は、毒入りサンプルが実世界のデータセットに広く蔓延し、意図せず埋め込まれている可能性があり、たとえクライアントが良性であっても、それらが全クライアントに分散している可能性を示唆している。本研究は、FITにおけるこの脅威を体系的に検証し、毒入りデータが全クライアントに散在する場合、既存の防御手法が無効であることを示す。この課題に対処するには、各クライアントにおける毒入りサンプルの識別特性を特定することと、一部のクライアントが毒入りサンプルに大きく支配されている状況でも協調的な防御を可能にすること、という2つの主要な困難が伴う。これらの困難に対処するため、我々は周波数領域における勾配を、毒入りデータを識別するための頑健な信号として同定した。さらに、クライアント間で毒入りサンプルを協調的に識別するためのグローバル二次クラスタリング機構を提案する。要約すると、本論文は、トレーニング中にクライアント間で散在する毒入りデータを正確に検出、除去、さらには浄化する、初のバックドアフリーFITフレームワークであるProtegoFedを提案する。4つのFLデータセットを用いた実験結果は、ProtegoFedが毒入りサンプルの92.00% sim 100.00%を識別し、攻撃成功率をほぼゼロに低減し、メインタスクでの有用性を維持することを示している。コードはhttps://github.com/dongdongzhaoUP/ProtegoFed で公開されている。

English

Federated Instruction Tuning (FIT) enables collaborative instruction tuning of large language models across multiple organizations (clients) in a cross-silo setting without requiring the sharing of private instructions. Recent findings on natural backdoors and the existing training data collection method suggest that poisoned samples may be pervasive and inadvertently embedded in real-world datasets, potentially distributed across all clients, even if the clients are benign. This work systematically examine this threat in FIT, demonstrating that existing defenses are ineffective when poisoned data is interspersed among all clients. Addressing this challenge entails two major difficulties: identifying the distinctive characteristics of poisoned samples at each client and enabling collaborative defense when some clients are heavily dominated by poisoned samples. To address these difficulties, we identify gradients in the frequency domain as a robust signal to distinguish poisoned data. We further propose a global secondary clustering mechanism that facilitates collaborative identification of poisoned samples across clients. In summary, this paper introduces ProtegoFed, the first backdoor-free FIT framework that accurately detects, removes, and even purifies interspersed poisoned data across clients during the training. Experimental results on four FL datasets show that ProtegoFed identifies 92.00% sim 100.00% of poisoned samples, reduces the attack success rate to almost zero, and maintains utility on the main task. Code is available at https://github.com/dongdongzhaoUP/ProtegoFed.

ProtegoFed：毒データを散りばめたバックドアフリー連合命令チューニング

ProtegoFed: Backdoor-Free Federated Instruction Tuning with Interspersed Poisoned Data

要旨

Support