STAR: Rappresentazione Semantica di Tabelle con Clustering Consapevole degli Intestazioni e Fusione Pesata Adattiva

Abstract

Il recupero di tabelle è il compito di recuperare le tabelle più rilevanti da corpora su larga scala date query in linguaggio naturale. Tuttavia, le discrepanze strutturali e semantiche tra testo non strutturato e tabelle strutturate rendono l'allineamento degli embedding particolarmente impegnativo. Metodi recenti come QGpT tentano di arricchire la semantica delle tabelle generando query sintetiche, ma si basano comunque su campionamenti parziali grossolani e semplici strategie di fusione, che limitano la diversità semantica e ostacolano un efficace allineamento query-tabella. Proponiamo STAR (Semantic Table Representation), un framework leggero che migliora la rappresentazione semantica delle tabelle attraverso clustering semantico e fusione pesata. STAR applica prima un clustering K-means consapevole degli header per raggruppare righe semanticamente simili e seleziona istanze centroidi rappresentative per costruire una tabella parziale diversificata. Successivamente, genera query sintetiche specifiche per cluster per coprire in modo completo lo spazio semantico della tabella. Infine, STAR impiega strategie di fusione pesata per integrare gli embedding di tabelle e query, consentendo un allineamento semantico granulare. Questo design permette a STAR di catturare informazioni complementari da fonti strutturate e testuali, migliorando l'espressività delle rappresentazioni tabellari. Esperimenti su cinque benchmark mostrano che STAR raggiunge Recall costantemente più alti di QGpT su tutti i dataset, dimostrando l'efficacia del clustering semantico e della fusione pesata adattiva per una robusta rappresentazione tabellare. Il nostro codice è disponibile all'indirizzo https://github.com/adsl135789/STAR.

English

Table retrieval is the task of retrieving the most relevant tables from large-scale corpora given natural language queries. However, structural and semantic discrepancies between unstructured text and structured tables make embedding alignment particularly challenging. Recent methods such as QGpT attempt to enrich table semantics by generating synthetic queries, yet they still rely on coarse partial-table sampling and simple fusion strategies, which limit semantic diversity and hinder effective query-table alignment. We propose STAR (Semantic Table Representation), a lightweight framework that improves semantic table representation through semantic clustering and weighted fusion. STAR first applies header-aware K-means clustering to group semantically similar rows and selects representative centroid instances to construct a diverse partial table. It then generates cluster-specific synthetic queries to comprehensively cover the table's semantic space. Finally, STAR employs weighted fusion strategies to integrate table and query embeddings, enabling fine-grained semantic alignment. This design enables STAR to capture complementary information from structured and textual sources, improving the expressiveness of table representations. Experiments on five benchmarks show that STAR achieves consistently higher Recall than QGpT on all datasets, demonstrating the effectiveness of semantic clustering and adaptive weighted fusion for robust table representation. Our code is available at https://github.com/adsl135789/STAR.

STAR: Rappresentazione Semantica di Tabelle con Clustering Consapevole degli Intestazioni e Fusione Pesata Adattiva

STAR: Semantic Table Representation with Header-Aware Clustering and Adaptive Weighted Fusion

Abstract

Support