T2R-bench：一个从现实工业表格生成文章级报告的基准测试平台

摘要

大量研究已深入探讨了大语言模型（LLMs）在表格推理方面的能力。然而，将表格信息转化为报告这一核心任务，在工业应用中仍面临重大挑战。该任务主要受限于两大关键问题：1）表格的复杂性和多样性导致推理结果不尽如人意；2）现有的表格基准测试缺乏充分评估该任务实际应用的能力。为填补这一空白，我们提出了表格到报告（table-to-report）任务，并构建了一个名为T2R-bench的双语基准测试，其中关键信息从表格流向报告。该基准包含457个工业表格，均源自真实场景，涵盖19个行业领域及4种工业表格类型。此外，我们提出了一套评估标准，以公正衡量报告生成的质量。对25种广泛使用的LLMs进行的实验显示，即便是如Deepseek-R1这样的顶尖模型，其整体得分也仅为62.71，表明LLMs在T2R-bench上仍有提升空间。源代码与数据将在论文被接受后公开。

English

Extensive research has been conducted to explore the capabilities of large language models (LLMs) in table reasoning. However, the essential task of transforming tables information into reports remains a significant challenge for industrial applications. This task is plagued by two critical issues: 1) the complexity and diversity of tables lead to suboptimal reasoning outcomes; and 2) existing table benchmarks lack the capacity to adequately assess the practical application of this task. To fill this gap, we propose the table-to-report task and construct a bilingual benchmark named T2R-bench, where the key information flow from the tables to the reports for this task. The benchmark comprises 457 industrial tables, all derived from real-world scenarios and encompassing 19 industry domains as well as 4 types of industrial tables. Furthermore, we propose an evaluation criteria to fairly measure the quality of report generation. The experiments on 25 widely-used LLMs reveal that even state-of-the-art models like Deepseek-R1 only achieves performance with 62.71 overall score, indicating that LLMs still have room for improvement on T2R-bench. Source code and data will be available after acceptance.