ChatPaper.aiChatPaper

使用二进制扩散生成表格数据

Tabular Data Generation using Binary Diffusion

September 20, 2024
作者: Vitaliy Kinakh, Slava Voloshynovskiy
cs.AI

摘要

在机器学习中,生成合成表格数据至关重要,特别是在真实数据有限或敏感的情况下。传统生成模型通常面临挑战,因为表格数据具有独特的特征,如混合数据类型和不同的分布,需要复杂的预处理或大型预训练模型。本文介绍了一种新颖的、无损的二进制转换方法,将任何表格数据转换为固定大小的二进制表示,并提出了一种名为二进制扩散的新生成模型,专门用于二进制数据。二进制扩散利用XOR操作的简单性进行噪声添加和去除,并采用二进制交叉熵损失进行训练。我们的方法消除了对大量预处理、复杂噪声参数调整和在大型数据集上进行预训练的需求。我们在几个流行的表格基准数据集上评估了我们的模型,结果表明,二进制扩散在旅行、成年人收入和糖尿病数据集上优于现有的最先进模型,同时体积显著更小。
English
Generating synthetic tabular data is critical in machine learning, especially when real data is limited or sensitive. Traditional generative models often face challenges due to the unique characteristics of tabular data, such as mixed data types and varied distributions, and require complex preprocessing or large pretrained models. In this paper, we introduce a novel, lossless binary transformation method that converts any tabular data into fixed-size binary representations, and a corresponding new generative model called Binary Diffusion, specifically designed for binary data. Binary Diffusion leverages the simplicity of XOR operations for noise addition and removal and employs binary cross-entropy loss for training. Our approach eliminates the need for extensive preprocessing, complex noise parameter tuning, and pretraining on large datasets. We evaluate our model on several popular tabular benchmark datasets, demonstrating that Binary Diffusion outperforms existing state-of-the-art models on Travel, Adult Income, and Diabetes datasets while being significantly smaller in size.

Summary

AI-Generated Summary

PDF43November 16, 2024