Tabby：基於語言模型的表格數據合成

摘要

儘管近年來大型語言模型（LLMs）的進步大幅提升了合成文本數據的質量，但表格數據的合成卻相對受到較少關注。我們針對這一差距提出了Tabby，這是一種對標準Transformer語言模型架構進行簡單但強大的訓練後修改，使其能夠用於表格數據集的合成。Tabby通過使用門控專家混合（Gated Mixture-of-Experts）來表示各列之間的差異，並為每列配備特定的參數集。實證結果顯示，Tabby生成的數據質量接近或等同於真實數據。通過將我們新穎的LLM表格訓練技術Plain與Tabby結合，我們觀察到數據質量相較於先前方法提升了高達44%。我們還展示了Tabby不僅限於表格數據，還能擴展到更一般的結構化數據，在一個嵌套的JSON數據集上也達到了與真實數據相當的水平。

English

While advances in large language models (LLMs) have greatly improved the quality of synthetic text data in recent years, synthesizing tabular data has received relatively less attention. We address this disparity with Tabby, a simple but powerful post-training modification to the standard Transformer language model architecture, enabling its use for tabular dataset synthesis. Tabby enables the representation of differences across columns using Gated Mixture-of-Experts, with column-specific sets of parameters. Empirically, Tabby results in data quality near or equal to that of real data. By pairing our novel LLM table training technique, Plain, with Tabby, we observe up to a 44% improvement in quality over previous methods. We also show that Tabby extends beyond tables to more general structured data, reaching parity with real data on a nested JSON dataset as well.

Tabby：基於語言模型的表格數據合成

Tabby: Tabular Data Synthesis with Language Models

摘要

Support