Tabby: 언어 모델을 활용한 테이블 데이터 합성

초록

대규모 언어 모델(LLM)의 발전으로 최근 몇 년 동안 합성 텍스트 데이터의 품질이 크게 향상되었지만, 표 형식 데이터의 합성은 상대적으로 덜 주목받아 왔습니다. 우리는 이러한 격차를 해소하기 위해 표준 Transformer 언어 모델 아키텍처에 간단하지만 강력한 사후 학습 수정을 적용한 Tabby를 제안합니다. Tabby는 Gated Mixture-of-Experts를 사용하여 열 간의 차이를 표현할 수 있으며, 각 열에 특화된 매개변수 세트를 갖추고 있습니다. 실험적으로, Tabby는 실제 데이터와 거의 동등하거나 동일한 수준의 데이터 품질을 달성합니다. 우리의 새로운 LLM 테이블 학습 기법인 Plain을 Tabby와 결합했을 때, 이전 방법 대비 최대 44%의 품질 향상을 관찰했습니다. 또한 Tabby는 테이블을 넘어 더 일반적인 구조화된 데이터에도 적용 가능하며, 중첩된 JSON 데이터셋에서도 실제 데이터와 동등한 성능을 보임을 확인했습니다.

English

While advances in large language models (LLMs) have greatly improved the quality of synthetic text data in recent years, synthesizing tabular data has received relatively less attention. We address this disparity with Tabby, a simple but powerful post-training modification to the standard Transformer language model architecture, enabling its use for tabular dataset synthesis. Tabby enables the representation of differences across columns using Gated Mixture-of-Experts, with column-specific sets of parameters. Empirically, Tabby results in data quality near or equal to that of real data. By pairing our novel LLM table training technique, Plain, with Tabby, we observe up to a 44% improvement in quality over previous methods. We also show that Tabby extends beyond tables to more general structured data, reaching parity with real data on a nested JSON dataset as well.

Tabby: 언어 모델을 활용한 테이블 데이터 합성

Tabby: Tabular Data Synthesis with Language Models

초록

Support