FedPS:基于聚合统计的联邦数据预处理
FedPS: Federated data Preprocessing via aggregated Statistics
February 11, 2026
作者: Xuefeng Xu, Graham Cormode
cs.AI
摘要
联邦学习(FL)使得多方能够在不共享原始数据的情况下协同训练机器学习模型。然而在训练开始前,必须通过数据预处理解决缺失值、格式不一致和特征尺度异构等问题。这一预处理阶段对模型性能至关重要,但在联邦学习研究中长期被忽视。实际联邦系统中,隐私约束禁止原始数据集中化处理,而通信效率要求又为分布式预处理带来额外挑战。我们提出FedPS——基于聚合统计量的联邦数据预处理统一框架。该框架利用数据素描技术高效汇总本地数据集,同时保留关键统计信息。基于这些统计摘要,我们设计了面向特征缩放、编码、离散化和缺失值填补的联邦算法,并将k均值、k近邻、贝叶斯线性回归等与预处理相关的模型扩展至横向与纵向联邦学习场景。FedPS为实际联邦学习部署提供了灵活、通信高效且保持一致的预处理流程。
English
Federated Learning (FL) enables multiple parties to collaboratively train machine learning models without sharing raw data. However, before training, data must be preprocessed to address missing values, inconsistent formats, and heterogeneous feature scales. This preprocessing stage is critical for model performance but is largely overlooked in FL research. In practical FL systems, privacy constraints prohibit centralizing raw data, while communication efficiency introduces further challenges for distributed preprocessing. We introduce FedPS, a unified framework for federated data preprocessing based on aggregated statistics. FedPS leverages data-sketching techniques to efficiently summarize local datasets while preserving essential statistical information. Building on these summaries, we design federated algorithms for feature scaling, encoding, discretization, and missing-value imputation, and extend preprocessing-related models such as k-Means, k-Nearest Neighbors, and Bayesian Linear Regression to both horizontal and vertical FL settings. FedPS provides flexible, communication-efficient, and consistent preprocessing pipelines for practical FL deployments.