STAR:基于头部感知聚类与自适应加权融合的语义化表格表征
STAR: Semantic Table Representation with Header-Aware Clustering and Adaptive Weighted Fusion
January 22, 2026
作者: Shui-Hsiang Hsu, Tsung-Hsiang Chou, Chen-Jui Yu, Yao-Chung Fan
cs.AI
摘要
表格检索是从大规模语料库中根据自然语言查询返回最相关表格的任务。然而,非结构化文本与结构化表格之间的结构和语义差异使得嵌入对齐尤为困难。近期方法如QGpT尝试通过生成合成查询来丰富表格语义,但仍依赖于粗糙的局部表格采样和简单融合策略,限制了语义多样性并阻碍有效的查询-表格对齐。我们提出STAR(语义表格表示)框架,该轻量级框架通过语义聚类和加权融合提升表格语义表示能力。STAR首先采用表头感知K均值聚类对语义相似的行进行分组,并选择代表性中心实例构建多样化的局部表格;随后生成针对特定聚类的合成查询,全面覆盖表格的语义空间;最后通过加权融合策略整合表格与查询嵌入,实现细粒度语义对齐。该设计使STAR能够从结构化和文本化来源中捕获互补信息,提升表格表示的表达能力。在五个基准数据集上的实验表明,STAR在所有数据集上的召回率均持续优于QGpT,验证了语义聚类与自适应加权融合对构建鲁棒表格表示的有效性。代码已开源:https://github.com/adsl135789/STAR。
English
Table retrieval is the task of retrieving the most relevant tables from large-scale corpora given natural language queries. However, structural and semantic discrepancies between unstructured text and structured tables make embedding alignment particularly challenging. Recent methods such as QGpT attempt to enrich table semantics by generating synthetic queries, yet they still rely on coarse partial-table sampling and simple fusion strategies, which limit semantic diversity and hinder effective query-table alignment. We propose STAR (Semantic Table Representation), a lightweight framework that improves semantic table representation through semantic clustering and weighted fusion. STAR first applies header-aware K-means clustering to group semantically similar rows and selects representative centroid instances to construct a diverse partial table. It then generates cluster-specific synthetic queries to comprehensively cover the table's semantic space. Finally, STAR employs weighted fusion strategies to integrate table and query embeddings, enabling fine-grained semantic alignment. This design enables STAR to capture complementary information from structured and textual sources, improving the expressiveness of table representations. Experiments on five benchmarks show that STAR achieves consistently higher Recall than QGpT on all datasets, demonstrating the effectiveness of semantic clustering and adaptive weighted fusion for robust table representation. Our code is available at https://github.com/adsl135789/STAR.