STAR:基于头部感知聚类与自适应加权融合的语义表格表示
STAR: Semantic Table Representation with Header-Aware Clustering and Adaptive Weighted Fusion
January 22, 2026
作者: Shui-Hsiang Hsu, Tsung-Hsiang Chou, Chen-Jui Yu, Yao-Chung Fan
cs.AI
摘要
表格检索是根據自然語言查詢從大規模語料庫中檢索最相關表格的任務。然而,非結構化文本與結構化表格間的結構性和語義差異,使得嵌入對齊面臨特殊挑戰。近期如QGpT等方法嘗試通過生成合成查詢來豐富表格語義,但仍依賴於粗粒度的部分表格採樣和簡單融合策略,限制了語義多樣性並阻礙有效的查詢-表格對齊。我們提出STAR(語義表格表徵)框架,該輕量級框架通過語義聚類和加權融合來改進語義表格表徵。STAR首先應用表頭感知K均值聚類對語義相似的行進行分組,並選取代表性質心實例來構建多樣化的部分表格;接著生成針對特定聚類的合成查詢,全面覆蓋表格的語義空間;最後採用加權融合策略整合表格與查詢嵌入,實現細粒度語義對齊。該設計使STAR能從結構化和文本化數據源捕捉互補信息,提升表格表徵的表達力。在五個基準數據集上的實驗表明,STAR在所有數據集上均較QGpT實現了持續更高的召回率,證實了語義聚類與自適應加權融合對魯棒表格表徵的有效性。代碼已開源於https://github.com/adsl135789/STAR。
English
Table retrieval is the task of retrieving the most relevant tables from large-scale corpora given natural language queries. However, structural and semantic discrepancies between unstructured text and structured tables make embedding alignment particularly challenging. Recent methods such as QGpT attempt to enrich table semantics by generating synthetic queries, yet they still rely on coarse partial-table sampling and simple fusion strategies, which limit semantic diversity and hinder effective query-table alignment. We propose STAR (Semantic Table Representation), a lightweight framework that improves semantic table representation through semantic clustering and weighted fusion. STAR first applies header-aware K-means clustering to group semantically similar rows and selects representative centroid instances to construct a diverse partial table. It then generates cluster-specific synthetic queries to comprehensively cover the table's semantic space. Finally, STAR employs weighted fusion strategies to integrate table and query embeddings, enabling fine-grained semantic alignment. This design enables STAR to capture complementary information from structured and textual sources, improving the expressiveness of table representations. Experiments on five benchmarks show that STAR achieves consistently higher Recall than QGpT on all datasets, demonstrating the effectiveness of semantic clustering and adaptive weighted fusion for robust table representation. Our code is available at https://github.com/adsl135789/STAR.