TabReX:表格无参考可解释性评估
TabReX : Tabular Referenceless eXplainable Evaluation
December 17, 2025
作者: Tejas Anvekar, Juhna Park, Aparna Garimella, Vivek Gupta
cs.AI
摘要
评估大型语言模型(LLM)生成的表格质量仍是一个开放性挑战:现有指标或将表格扁平化为文本而忽略结构,或依赖固定参考标准限制泛化能力。我们提出TabReX——一个基于属性驱动、无需参考的表格生成评估框架,通过图推理实现量化评估。该框架将源文本与生成表格转化为规范知识图谱,经由LLM引导的匹配流程实现对齐,最终输出可解释的、符合评估细则的分数,量化结构与事实保真度。该指标可在敏感度与特异性间实现可控权衡,生成与人类判断对齐的评估结果及单元格级错误追溯。为系统评估指标鲁棒性,我们构建TabReX-Bench大规模基准测试集,涵盖六大领域、十二种规划器驱动的扰动类型,并设置三个难度层级。实验表明TabReX与专家评分相关性最高,在强扰动下保持稳定,支持细粒度的模型vs提示词分析,为结构化生成系统建立了可信可解释评估的新范式。
English
Evaluating the quality of tables generated by large language models (LLMs) remains an open challenge: existing metrics either flatten tables into text, ignoring structure, or rely on fixed references that limit generalization. We present TabReX, a reference-less, property-driven framework for evaluating tabular generation via graph-based reasoning. TabReX converts both source text and generated tables into canonical knowledge graphs, aligns them through an LLM-guided matching process, and computes interpretable, rubric-aware scores that quantify structural and factual fidelity. The resulting metric provides controllable trade-offs between sensitivity and specificity, yielding human-aligned judgments and cell-level error traces. To systematically asses metric robustness, we introduce TabReX-Bench, a large-scale benchmark spanning six domains and twelve planner-driven perturbation types across three difficulty tiers. Empirical results show that TabReX achieves the highest correlation with expert rankings, remains stable under harder perturbations, and enables fine-grained model-vs-prompt analysis establishing a new paradigm for trustworthy, explainable evaluation of structured generation systems.