TabEmbed: 표 형태 데이터 이해를 위한 범용 임베딩 벤치마킹 및 학습

초록

파운데이션 모델은 자연어 처리 분야에서 통합된 표현 체계를 구축했으나, 이러한 패러다임은 테이블 형식 데이터에는 여전히 미개척 상태로 남아 있습니다. 기존 방법론은 근본적인 한계에 직면해 있습니다: 대규모 언어 모델 기반 접근법은 검색 호환 벡터 출력이 부재한 반면, 텍스트 임베딩 모델은 종종 테이블 구조와 수치적 의미를 제대로 포착하지 못합니다. 이러한 격차를 해소하기 위해 우리는 먼저 임베딩 모델의 테이블 이해 능력을 평가하기 위한 종합적인 벤치마크인 Tabular Embedding Benchmark(TabBench)를 소개합니다. 그런 다음 공유 임베딩 공간 내에서 테이블 분류와 검색을 통합하는 최초의 범용 임베딩 모델인 TabEmbed를 제안합니다. 다양한 테이블 작업을 의미적 매칭 문제로 재정의함으로써, TabEmbed는 긍정 인식 하드 네거티브 마이닝을 통한 대규모 대조 학습을 활용하여 세분화된 구조적 및 수치적 뉘앙스를 식별합니다. TabBench에 대한 실험 결과는 TabEmbed가 최첨단 텍스트 임베딩 모델들을 크게 능가하며, 범용 테이블 표현 학습의 새로운 기준을 수립함을 보여줍니다. 코드와 데이터셋은 https://github.com/qiangminjie27/TabEmbed와 https://huggingface.co/datasets/qiangminjie27/TabBench에서 공개되어 있습니다.

English

Foundation models have established unified representations for natural language processing, yet this paradigm remains largely unexplored for tabular data. Existing methods face fundamental limitations: LLM-based approaches lack retrieval-compatible vector outputs, whereas text embedding models often fail to capture tabular structure and numerical semantics. To bridge this gap, we first introduce the Tabular Embedding Benchmark (TabBench), a comprehensive suite designed to evaluate the tabular understanding capability of embedding models. We then propose TabEmbed, the first generalist embedding model that unifies tabular classification and retrieval within a shared embedding space. By reformulating diverse tabular tasks as semantic matching problems, TabEmbed leverages large-scale contrastive learning with positive-aware hard negative mining to discern fine-grained structural and numerical nuances. Experimental results on TabBench demonstrate that TabEmbed significantly outperforms state-of-the-art text embedding models, establishing a new baseline for universal tabular representation learning. Code and datasets are publicly available at https://github.com/qiangminjie27/TabEmbed and https://huggingface.co/datasets/qiangminjie27/TabBench.

TabEmbed: 표 형태 데이터 이해를 위한 범용 임베딩 벤치마킹 및 학습

TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding

초록

Support