ChatPaper.aiChatPaper

基于表征稳定性的表格检索鲁棒性提升

Improving Robustness of Tabular Retrieval via Representational Stability

April 27, 2026
作者: Kushal Raj Bhandari, Adarsh Singh, Jianxi Gao, Soham Dan, Vivek Gupta
cs.AI

摘要

基于Transformer的表格检索系统将结构化表格展平为令牌序列,导致即使表格语义保持不变,检索结果仍对序列化方式的选择高度敏感。我们发现语义等效的序列化格式(如csv、tsv、html、markdown和ddl)在多个基准测试和检索模型家族中会产生显著不同的嵌入表示和检索结果。为解决这种不稳定性,我们将序列化嵌入视为共享语义信号的带噪视图,并以其质心作为规范化目标表示。研究表明,质心平均法能抑制格式特异性变异,当不同表格的格式诱发偏移存在差异时,该方法可恢复不同序列化格式共有的语义内容。在MPNet、BGE-M3、ReasonIR和SPLADE的聚合成对比较中,质心表示法的综合表现优于所有单一格式。我们进一步在冻结编码器之上引入轻量级残差瓶颈适配器,该适配器将单序列化嵌入映射至质心目标,同时保持方差并实施协方差正则化。实验表明该适配器能提升多种稠密检索器的鲁棒性,但改进效果因模型而异且对稀疏词法检索作用有限。这些结果揭示了序列化敏感性是检索方差的主要来源,并证明了后验几何校正在实现序列化无关表格检索方面的潜力。相关代码、数据集和模型已开源:https://github.com/KBhandari11/Centroid-Aligned-Table-Retrieval。
English
Transformer-based table retrieval systems flatten structured tables into token sequences, making retrieval sensitive to the choice of serialization even when table semantics remain unchanged. We show that semantically equivalent serializations, such as csv, tsv, html, markdown, and ddl, can produce substantially different embeddings and retrieval results across multiple benchmarks and retriever families. To address this instability, we treat serialization embedding as noisy views of a shared semantic signal and use its centroid as a canonical target representation. We show that centroid averaging suppresses format-specific variation and can recover the semantic content common to different serializations when format-induced shifts differ across tables. Empirically, centroid representations outrank individual formats in aggregate pairwise comparisons across MPNet, BGE-M3, ReasonIR, and SPLADE. We further introduce a lightweight residual bottleneck adapter on top of a frozen encoder that maps single-serialization embeddings towards centroid targets while preserving variance and enforcing covariance regularization. The adapter improves robustness for several dense retrievers, though gains are model-dependent and weaker for sparse lexical retrieval. These results identify serialization sensitivity as a major source of retrieval variance and show the promise of post hoc geometric correction for serialization-invariant table retrieval. Our code, datasets, and models are available at https://github.com/KBhandari11/Centroid-Aligned-Table-Retrieval{https://github.com/KBhandari11/Centroid-Aligned-Table-Retrieval}.