TabEmbed: 表データ理解のための汎用埋め込みのベンチマーキングと学習

要旨

ファウンデーションモデルは自然言語処理において統一的な表現を確立してきたが、このパラダイムは表形式データにおいては未だほとんど探求されていない。既存手法は根本的な限界に直面している：LLMベースのアプローチは検索互換のベクトル出力を欠き、一方でテキスト埋め込みモデルは表構造や数値的意味論を十分に捉えられないことが多い。この隔たりを埋めるため、我々はまず、埋め込みモデルの表形式データ理解能力を評価する包括的ベンチマークスイートであるTabular Embedding Benchmark（TabBench）を提案する。次に、表形式分類と検索を共有の埋め込み空間内で統一する、初の汎用埋め込みモデルTabEmbedを提案する。多様な表形式タスクを意味的マッチング問題として再定式化することで、TabEmbedはポジティブ認識型ハードネガティブマイニングを用いた大規模な対照学習により、細粒度の構造的・数値的ニュアンスを識別する。TabBenchにおける実験結果は、TabEmbedが最先端のテキスト埋め込みモデルを大幅に上回り、普遍的な表形式表現学習の新たなベースラインを確立することを示している。コードとデータセットはhttps://github.com/qiangminjie27/TabEmbed および https://huggingface.co/datasets/qiangminjie27/TabBench で公開されている。

English

Foundation models have established unified representations for natural language processing, yet this paradigm remains largely unexplored for tabular data. Existing methods face fundamental limitations: LLM-based approaches lack retrieval-compatible vector outputs, whereas text embedding models often fail to capture tabular structure and numerical semantics. To bridge this gap, we first introduce the Tabular Embedding Benchmark (TabBench), a comprehensive suite designed to evaluate the tabular understanding capability of embedding models. We then propose TabEmbed, the first generalist embedding model that unifies tabular classification and retrieval within a shared embedding space. By reformulating diverse tabular tasks as semantic matching problems, TabEmbed leverages large-scale contrastive learning with positive-aware hard negative mining to discern fine-grained structural and numerical nuances. Experimental results on TabBench demonstrate that TabEmbed significantly outperforms state-of-the-art text embedding models, establishing a new baseline for universal tabular representation learning. Code and datasets are publicly available at https://github.com/qiangminjie27/TabEmbed and https://huggingface.co/datasets/qiangminjie27/TabBench.

TabEmbed: 表データ理解のための汎用埋め込みのベンチマーキングと学習

TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding

要旨

Support