TabReX：表格數據無參照可解釋性評估框架

摘要

评估大型语言模型（LLM）生成表格的质量仍是一个开放性挑战：现有指标或将表格扁平化为文本而忽略结构，或依赖固定参考标准从而限制泛化能力。我们提出TabReX——一个基于属性驱动、无需参考标准的表格生成评估框架，通过图推理实现评估。该框架将源文本和生成表格转化为规范化知识图谱，经由LLM引导的匹配流程实现对齐，最终通过可解释的规则化评分量化结构与事实保真度。该指标可在敏感度与特异性之间实现可控权衡，生成符合人类判断的单元格级错误追踪。为系统评估指标鲁棒性，我们构建了TabReX-Bench大规模基准数据集，涵盖六大领域、十二种规划器驱动的扰动类型，并划分为三个难度层级。实验结果表明，TabReX在专家排名相关性上达到最高水平，在强扰动下保持稳定，并能实现模型与提示词的细粒度对比分析，为结构化生成系统建立了可信可解释评估的新范式。

English

Evaluating the quality of tables generated by large language models (LLMs) remains an open challenge: existing metrics either flatten tables into text, ignoring structure, or rely on fixed references that limit generalization. We present TabReX, a reference-less, property-driven framework for evaluating tabular generation via graph-based reasoning. TabReX converts both source text and generated tables into canonical knowledge graphs, aligns them through an LLM-guided matching process, and computes interpretable, rubric-aware scores that quantify structural and factual fidelity. The resulting metric provides controllable trade-offs between sensitivity and specificity, yielding human-aligned judgments and cell-level error traces. To systematically asses metric robustness, we introduce TabReX-Bench, a large-scale benchmark spanning six domains and twelve planner-driven perturbation types across three difficulty tiers. Empirical results show that TabReX achieves the highest correlation with expert rankings, remains stable under harder perturbations, and enables fine-grained model-vs-prompt analysis establishing a new paradigm for trustworthy, explainable evaluation of structured generation systems.

TabReX：表格數據無參照可解釋性評估框架

TabReX : Tabular Referenceless eXplainable Evaluation

摘要

Support