非獨立同分佈數據聯邦學習中基於多任務自動編碼器的樣本選擇方法
Sample Selection Using Multi-Task Autoencoders in Federated Learning with Non-IID Data
April 28, 2026
作者: Emre Ardıç, Yakup Genç
cs.AI
摘要
联邦学习是一种在中央服务器协调下,多个设备协同训练模型并确保数据隐私的机器学习范式。然而,冗余样本、恶意样本及异常样本常导致模型性能下降和效率低下。为解决这些问题,我们提出面向图像分类的新型样本选择方法,采用多任务自编码器通过损失值和特征分析来评估样本贡献度。该方法融合无监督异常检测技术,由中央服务器调度单类支持向量机(OCSVM)、隔离森林(IF)和自适应损失阈值(AT)方法,以过滤客户端中的噪声样本。我们还提出由中央服务器调控的多类别深度支持向量数据描述(SVDD)损失函数,以增强基于特征的样本选择能力。我们在CIFAR10和MNIST数据集上验证了所提方法,涵盖不同客户端数量、非独立同分布数据场景以及高达40%的噪声水平。实验表明:基于损失值的样本选择能显著提升准确率,在CIFAR10数据集上使用OCSVM方法最高提升7.02%,在MNIST数据集上使用AT方法最高提升1.83%;此外,联邦SVDD损失函数进一步优化了基于特征的样本选择,在CIFAR10数据集上结合OCSVM方法可实现0.99%的准确率提升。这些结果证明了我们的方法在不同客户端规模和噪声条件下提升模型准确率的有效性。
English
Federated learning is a machine learning paradigm in which multiple devices collaboratively train a model under the supervision of a central server while ensuring data privacy. However, its performance is often hindered by redundant, malicious, or abnormal samples, leading to model degradation and inefficiency. To overcome these issues, we propose novel sample selection methods for image classification, employing a multitask autoencoder to estimate sample contributions through loss and feature analysis. Our approach incorporates unsupervised outlier detection, using one-class support vector machine (OCSVM), isolation forest (IF), and adaptive loss threshold (AT) methods managed by a central server to filter noisy samples on clients. We also propose a multi-class deep support vector data description (SVDD) loss controlled by a central server to enhance feature-based sample selection. We validate our methods on CIFAR10 and MNIST datasets across varying numbers of clients, non-IID distributions, and noise levels up to 40%. The results show significant accuracy improvements with loss-based sample selection, achieving gains of up to 7.02% on CIFAR10 with OCSVM and 1.83% on MNIST with AT. Additionally, our federated SVDD loss further improves feature-based sample selection, yielding accuracy gains of up to 0.99% on CIFAR10 with OCSVM. These results show the effectiveness of our methods in improving model accuracy across various client counts and noise conditions.