ChatPaper.aiChatPaper

TabReX:表格數據無參照可解釋性評估框架

TabReX : Tabular Referenceless eXplainable Evaluation

December 17, 2025
作者: Tejas Anvekar, Juhna Park, Aparna Garimella, Vivek Gupta
cs.AI

摘要

评估大型语言模型(LLM)生成表格的质量仍是一个开放性挑战:现有指标或将表格扁平化为文本而忽略结构,或依赖固定参考标准从而限制泛化能力。我们提出TabReX——一个基于属性驱动、无需参考标准的表格生成评估框架,通过图推理实现评估。该框架将源文本和生成表格转化为规范化知识图谱,经由LLM引导的匹配流程实现对齐,最终通过可解释的规则化评分量化结构与事实保真度。该指标可在敏感度与特异性之间实现可控权衡,生成符合人类判断的单元格级错误追踪。为系统评估指标鲁棒性,我们构建了TabReX-Bench大规模基准数据集,涵盖六大领域、十二种规划器驱动的扰动类型,并划分为三个难度层级。实验结果表明,TabReX在专家排名相关性上达到最高水平,在强扰动下保持稳定,并能实现模型与提示词的细粒度对比分析,为结构化生成系统建立了可信可解释评估的新范式。
English
Evaluating the quality of tables generated by large language models (LLMs) remains an open challenge: existing metrics either flatten tables into text, ignoring structure, or rely on fixed references that limit generalization. We present TabReX, a reference-less, property-driven framework for evaluating tabular generation via graph-based reasoning. TabReX converts both source text and generated tables into canonical knowledge graphs, aligns them through an LLM-guided matching process, and computes interpretable, rubric-aware scores that quantify structural and factual fidelity. The resulting metric provides controllable trade-offs between sensitivity and specificity, yielding human-aligned judgments and cell-level error traces. To systematically asses metric robustness, we introduce TabReX-Bench, a large-scale benchmark spanning six domains and twelve planner-driven perturbation types across three difficulty tiers. Empirical results show that TabReX achieves the highest correlation with expert rankings, remains stable under harder perturbations, and enables fine-grained model-vs-prompt analysis establishing a new paradigm for trustworthy, explainable evaluation of structured generation systems.
PDF11December 20, 2025