대규모 테이블 데이터를 위한 언어 모델링 기반 전이 학습

초록

표 형식 데이터 - 행과 열로 구성된 구조화되고 이질적인 스프레드시트 스타일 데이터 - 는 실제로 다양한 분야에서 널리 사용되고 있습니다. 그러나 최근의 파운데이션 모델들이 언어 모델링 및 컴퓨터 비전과 같은 분야에서 작업별 데이터셋과 예측기 개발의 필요성을 줄여왔음에도 불구하고, 이러한 전이 학습 패러다임은 표 형식 데이터 분야에서는 비슷한 영향을 미치지 못했습니다. 본 연구에서는 이러한 격차를 좁히고자 표 형식 예측을 위한 언어 모델인 TabuLa-8B를 제안합니다. 우리는 TabLib 코퍼스에서 대규모 고품질 학습 데이터셋을 추출하는 과정을 정의하고, 표 형식 데이터 필터링 및 품질 관리 방법을 제안합니다. 이를 통해 3.1M개의 고유한 테이블에서 추출된 1.6B개 이상의 행으로 구성된 데이터셋을 구축하고, 표 형식 예측(분류 및 구간 회귀)을 위해 Llama 3-8B 대형 언어 모델(LLM)을 새로운 패킹 및 어텐션 기법을 사용하여 미세 조정합니다. 329개의 데이터셋으로 구성된 테스트 스위트를 통해 평가한 결과, TabuLa-8B는 보이지 않는 테이블에 대해 제로샷 정확도가 무작위 추측보다 15% 포인트(pp) 이상 높은 성능을 보였으며, 이는 기존의 최첨단 표 형식 예측 모델(예: XGBoost, TabPFN)로는 달성할 수 없는 성과입니다. 퓨샷 설정(1-32샷)에서, 대상 데이터셋에 대한 미세 조정 없이도 TabuLa-8B는 동일하거나 최대 16배 더 많은 데이터로 명시적으로 학습된 XGBoost 및 TabPFN 모델보다 5-15 pp 더 정확했습니다. 우리는 본 논문과 함께 모델, 코드 및 데이터를 공개합니다.

English

Tabular data -- structured, heterogeneous, spreadsheet-style data with rows and columns -- is widely used in practice across many domains. However, while recent foundation models have reduced the need for developing task-specific datasets and predictors in domains such as language modeling and computer vision, this transfer learning paradigm has not had similar impact in the tabular domain. In this work, we seek to narrow this gap and present TabuLa-8B, a language model for tabular prediction. We define a process for extracting a large, high-quality training dataset from the TabLib corpus, proposing methods for tabular data filtering and quality control. Using the resulting dataset, which comprises over 1.6B rows from 3.1M unique tables, we fine-tune a Llama 3-8B large language model (LLM) for tabular data prediction (classification and binned regression) using a novel packing and attention scheme for tabular prediction. Through evaluation across a test suite of 329 datasets, we find that TabuLa-8B has zero-shot accuracy on unseen tables that is over 15 percentage points (pp) higher than random guessing, a feat that is not possible with existing state-of-the-art tabular prediction models (e.g. XGBoost, TabPFN). In the few-shot setting (1-32 shots), without any fine-tuning on the target datasets, TabuLa-8B is 5-15 pp more accurate than XGBoost and TabPFN models that are explicitly trained on equal, or even up to 16x more data. We release our model, code, and data along with the publication of this paper.

대규모 테이블 데이터를 위한 언어 모델링 기반 전이 학습

Large Scale Transfer Learning for Tabular Data via Language Modeling

초록

Support