CGPT:基于大语言生成监督的集群引导局部表格检索方法
CGPT: Cluster-Guided Partial Tables with LLM-Generated Supervision for Table Retrieval
January 22, 2026
作者: Tsung-Hsiang Chou, Chen-Jui Yu, Shui-Hsiang Hsu, Yao-Chung Fan
cs.AI
摘要
通用嵌入模型在文本检索中表现出色,但在表格检索场景中仍存在不足——高度结构化的内容会导致语义压缩及查询-表格失配问题。近期基于大语言模型的检索增强方法通过生成合成查询缓解了这一局限,但这些方法往往依赖启发式局部表格选择策略,且鲜少将合成查询作为监督信号来优化嵌入模型。我们提出CGPT训练框架,通过大语言模型生成的监督信号提升表格检索性能。该框架首先采用K均值聚类对表格实例分组,并通过跨簇采样构建语义多样化的局部表格以扩展语义覆盖范围;随后利用大语言模型为这些局部表格生成合成查询,通过困难负例对比微调策略优化嵌入模型。在四个公开基准数据集上的实验表明,CGPT在检索效果上持续超越现有基线方法(包括QGpT),平均R@1指标提升16.54%。在统一多领域语料设置下,CGPT进一步展现出强大的跨领域泛化能力,即使采用更小规模的大语言模型生成合成查询仍能保持有效性。这些结果表明:语义引导的局部表格构建与大语言模型生成的监督信号相结合,可为大规模表格检索提供高效可扩展的解决方案。代码已开源于https://github.com/yumeow0122/CGPT。
English
General-purpose embedding models have demonstrated strong performance in text retrieval but remain suboptimal for table retrieval, where highly structured content leads to semantic compression and query-table mismatch. Recent LLM-based retrieval augmentation methods mitigate this issue by generating synthetic queries, yet they often rely on heuristic partial-table selection and seldom leverage these synthetic queries as supervision to improve the embedding model. We introduce CGPT, a training framework that enhances table retrieval through LLM-generated supervision. CGPT constructs semantically diverse partial tables by clustering table instances using K-means and sampling across clusters to broaden semantic coverage. An LLM then generates synthetic queries for these partial tables, which are used in hard-negative contrastive fine-tuning to refine the embedding model. Experiments across four public benchmarks (MimoTable, OTTQA, FetaQA, and E2E-WTQ) show that CGPT consistently outperforms retrieval baselines, including QGpT, with an average R@1 improvement of 16.54 percent. In a unified multi-domain corpus setting, CGPT further demonstrates strong cross-domain generalization and remains effective even when using smaller LLMs for synthetic query generation. These results indicate that semantically guided partial-table construction, combined with contrastive training from LLM-generated supervision, provides an effective and scalable paradigm for large-scale table retrieval. Our code is available at https://github.com/yumeow0122/CGPT.