T2R-bench：一個從真實世界工業表格生成文章級報告的基準測試

摘要

大量研究已深入探討了大語言模型（LLMs）在表格推理方面的能力。然而，將表格信息轉化為報告這一核心任務，在工業應用中仍面臨重大挑戰。此任務受到兩個關鍵問題的困擾：1）表格的複雜性和多樣性導致推理結果不盡理想；2）現有的表格基準測試缺乏足夠能力來充分評估該任務的實際應用。為填補這一空白，我們提出了表格到報告的任務，並構建了一個名為T2R-bench的雙語基準測試，其中關鍵信息流從表格流向報告。該基準測試包含457個工業表格，均源自真實場景，涵蓋19個行業領域及4種工業表格類型。此外，我們提出了一套評估標準，以公正衡量報告生成的質量。對25種廣泛使用的LLMs進行的實驗顯示，即便是如Deepseek-R1這樣的尖端模型，其整體得分也僅為62.71，表明LLMs在T2R-bench上仍有提升空間。源代碼和數據將在論文接受後公開。

English

Extensive research has been conducted to explore the capabilities of large language models (LLMs) in table reasoning. However, the essential task of transforming tables information into reports remains a significant challenge for industrial applications. This task is plagued by two critical issues: 1) the complexity and diversity of tables lead to suboptimal reasoning outcomes; and 2) existing table benchmarks lack the capacity to adequately assess the practical application of this task. To fill this gap, we propose the table-to-report task and construct a bilingual benchmark named T2R-bench, where the key information flow from the tables to the reports for this task. The benchmark comprises 457 industrial tables, all derived from real-world scenarios and encompassing 19 industry domains as well as 4 types of industrial tables. Furthermore, we propose an evaluation criteria to fairly measure the quality of report generation. The experiments on 25 widely-used LLMs reveal that even state-of-the-art models like Deepseek-R1 only achieves performance with 62.71 overall score, indicating that LLMs still have room for improvement on T2R-bench. Source code and data will be available after acceptance.