ChatPaper.aiChatPaper

ProtegoFed:基于交错投毒数据的无后门联邦指令调优

ProtegoFed: Backdoor-Free Federated Instruction Tuning with Interspersed Poisoned Data

February 28, 2026
作者: Haodong Zhao, Jinming Hu, Zhaomin Wu, Zongru Wu, Wei Du, Junyi Hou, Caibei Zhao, Zhuosheng Zhang, Bingsheng He, Gongshen Liu
cs.AI

摘要

联邦指令调优(FIT)支持跨组织(客户端)在数据孤岛环境下协同进行大语言模型的指令微调,且无需共享私有指令。近期关于自然后门的研究及现有训练数据收集方法表明,即使客户端均为良性,中毒样本仍可能普遍存在并无意间嵌入真实数据集,且潜在分布于所有客户端。本文系统性地探究了FIT中的这一威胁,证明当中毒数据分散于所有客户端时,现有防御机制将失效。应对该挑战需解决两大难题:识别各客户端中毒样本的独有特征,以及在部分客户端被中毒样本主导时实现协同防御。针对这些难题,我们发现频域梯度可作为区分中毒数据的稳健信号,并进一步提出全局二次聚类机制以实现跨客户端的中毒样本协同识别。综上,本文提出首个后门免疫的FIT框架ProtegoFed,能在训练过程中精准检测、清除甚至净化分散于各客户端的中毒数据。在四个联邦学习数据集上的实验表明,ProtegoFed可识别92.00%至100.00%的中毒样本,将攻击成功率降至接近零,同时保持主任务的性能。代码已开源:https://github.com/dongdongzhaoUP/ProtegoFed。
English
Federated Instruction Tuning (FIT) enables collaborative instruction tuning of large language models across multiple organizations (clients) in a cross-silo setting without requiring the sharing of private instructions. Recent findings on natural backdoors and the existing training data collection method suggest that poisoned samples may be pervasive and inadvertently embedded in real-world datasets, potentially distributed across all clients, even if the clients are benign. This work systematically examine this threat in FIT, demonstrating that existing defenses are ineffective when poisoned data is interspersed among all clients. Addressing this challenge entails two major difficulties: identifying the distinctive characteristics of poisoned samples at each client and enabling collaborative defense when some clients are heavily dominated by poisoned samples. To address these difficulties, we identify gradients in the frequency domain as a robust signal to distinguish poisoned data. We further propose a global secondary clustering mechanism that facilitates collaborative identification of poisoned samples across clients. In summary, this paper introduces ProtegoFed, the first backdoor-free FIT framework that accurately detects, removes, and even purifies interspersed poisoned data across clients during the training. Experimental results on four FL datasets show that ProtegoFed identifies 92.00% sim 100.00% of poisoned samples, reduces the attack success rate to almost zero, and maintains utility on the main task. Code is available at https://github.com/dongdongzhaoUP/ProtegoFed.
PDF11March 4, 2026