ChatPaper.aiChatPaper

TabEmbed:面向表格理解的通用嵌入基准与学习框架

TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding

May 6, 2026
作者: Minjie Qiang, Mingming Zhang, Xiaoyi Bao, Xing Fu, Yu Cheng, Weiqiang Wang, Zhongqing Wang, Ningtao Wang
cs.AI

摘要

基础模型已为自然语言处理建立了统一表征范式,然而该范式在表格数据领域仍鲜有探索。现有方法存在根本性局限:基于大语言模型的方法缺乏检索兼容的向量输出,而文本嵌入模型往往难以捕捉表格结构与数值语义。为弥补这一空白,我们首先提出表格嵌入基准(TabBench),这是一个用于评估嵌入模型表格理解能力的综合测试套件。随后我们推出TabEmbed——首个在共享嵌入空间中统一表格分类与检索任务的通用嵌入模型。通过将多样化表格任务重构为语义匹配问题,TabEmbed采用具备正样本感知的难负例挖掘策略进行大规模对比学习,从而精准辨识细粒度的结构与数值特征。在TabBench上的实验结果表明,TabEmbed显著优于当前最先进的文本嵌入模型,为通用表格表征学习确立了新基准。代码与数据集已公开于https://github.com/qiangminjie27/TabEmbed 和 https://huggingface.co/datasets/qiangminjie27/TabBench。
English
Foundation models have established unified representations for natural language processing, yet this paradigm remains largely unexplored for tabular data. Existing methods face fundamental limitations: LLM-based approaches lack retrieval-compatible vector outputs, whereas text embedding models often fail to capture tabular structure and numerical semantics. To bridge this gap, we first introduce the Tabular Embedding Benchmark (TabBench), a comprehensive suite designed to evaluate the tabular understanding capability of embedding models. We then propose TabEmbed, the first generalist embedding model that unifies tabular classification and retrieval within a shared embedding space. By reformulating diverse tabular tasks as semantic matching problems, TabEmbed leverages large-scale contrastive learning with positive-aware hard negative mining to discern fine-grained structural and numerical nuances. Experimental results on TabBench demonstrate that TabEmbed significantly outperforms state-of-the-art text embedding models, establishing a new baseline for universal tabular representation learning. Code and datasets are publicly available at https://github.com/qiangminjie27/TabEmbed and https://huggingface.co/datasets/qiangminjie27/TabBench.
PDF62May 9, 2026