ProtegoFed:基于交错投毒数据的无后门联邦指令调优
ProtegoFed: Backdoor-Free Federated Instruction Tuning with Interspersed Poisoned Data
February 28, 2026
作者: Haodong Zhao, Jinming Hu, Zhaomin Wu, Zongru Wu, Wei Du, Junyi Hou, Caibei Zhao, Zhuosheng Zhang, Bingsheng He, Gongshen Liu
cs.AI
摘要
联邦指令调优(FIT)支持多个组织(客户端)在跨孤岛环境中协同进行大语言模型的指令调优,且无需共享私有指令。近期关于自然后门的研究及现有训练数据收集方法表明,中毒样本可能普遍存在且被无意嵌入真实数据集,即使客户端均为良性方,这些数据仍可能分散在所有客户端中。本文系统性地研究了FIT中的这一威胁,证明当中毒数据分散存在于所有客户端时,现有防御机制均会失效。应对该挑战存在两大难点:如何识别各客户端中毒样本的独有特征,以及如何在部分客户端被中毒样本主导时实现协同防御。为解决这些难题,我们发现频域梯度可作为区分中毒数据的强鲁棒信号,并进一步提出全局二次聚类机制,促进跨客户端协同识别中毒样本。综上,本文提出首个后门免疫的FIT框架ProtegoFed,能够在训练过程中精准检测、清除甚至净化分散于各客户端的中毒数据。在四个联邦学习数据集上的实验表明,ProtegoFed可识别92.00%-100.00%的中毒样本,将攻击成功率降至接近零,同时保持主任务性能。代码已开源:https://github.com/dongdongzhaoUP/ProtegoFed。
English
Federated Instruction Tuning (FIT) enables collaborative instruction tuning of large language models across multiple organizations (clients) in a cross-silo setting without requiring the sharing of private instructions. Recent findings on natural backdoors and the existing training data collection method suggest that poisoned samples may be pervasive and inadvertently embedded in real-world datasets, potentially distributed across all clients, even if the clients are benign. This work systematically examine this threat in FIT, demonstrating that existing defenses are ineffective when poisoned data is interspersed among all clients. Addressing this challenge entails two major difficulties: identifying the distinctive characteristics of poisoned samples at each client and enabling collaborative defense when some clients are heavily dominated by poisoned samples. To address these difficulties, we identify gradients in the frequency domain as a robust signal to distinguish poisoned data. We further propose a global secondary clustering mechanism that facilitates collaborative identification of poisoned samples across clients. In summary, this paper introduces ProtegoFed, the first backdoor-free FIT framework that accurately detects, removes, and even purifies interspersed poisoned data across clients during the training. Experimental results on four FL datasets show that ProtegoFed identifies 92.00% sim 100.00% of poisoned samples, reduces the attack success rate to almost zero, and maintains utility on the main task. Code is available at https://github.com/dongdongzhaoUP/ProtegoFed.