ChatPaper.aiChatPaper

T2R-bench:一個從真實世界工業表格生成文章級報告的基準測試

T2R-bench: A Benchmark for Generating Article-Level Reports from Real World Industrial Tables

August 27, 2025
作者: Jie Zhang, Changzai Pan, Kaiwen Wei, Sishi Xiong, Yu Zhao, Xiangyu Li, Jiaxin Peng, Xiaoyan Gu, Jian Yang, Wenhan Chang, Zhenhe Wu, Jiang Zhong, Shuangyong Song, Yongxiang Li, Xuelong Li
cs.AI

摘要

大量研究已深入探討了大語言模型(LLMs)在表格推理方面的能力。然而,將表格信息轉化為報告這一核心任務,在工業應用中仍面臨重大挑戰。此任務受到兩個關鍵問題的困擾:1)表格的複雜性和多樣性導致推理結果不盡理想;2)現有的表格基準測試缺乏足夠能力來充分評估該任務的實際應用。為填補這一空白,我們提出了表格到報告的任務,並構建了一個名為T2R-bench的雙語基準測試,其中關鍵信息流從表格流向報告。該基準測試包含457個工業表格,均源自真實場景,涵蓋19個行業領域及4種工業表格類型。此外,我們提出了一套評估標準,以公正衡量報告生成的質量。對25種廣泛使用的LLMs進行的實驗顯示,即便是如Deepseek-R1這樣的尖端模型,其整體得分也僅為62.71,表明LLMs在T2R-bench上仍有提升空間。源代碼和數據將在論文接受後公開。
English
Extensive research has been conducted to explore the capabilities of large language models (LLMs) in table reasoning. However, the essential task of transforming tables information into reports remains a significant challenge for industrial applications. This task is plagued by two critical issues: 1) the complexity and diversity of tables lead to suboptimal reasoning outcomes; and 2) existing table benchmarks lack the capacity to adequately assess the practical application of this task. To fill this gap, we propose the table-to-report task and construct a bilingual benchmark named T2R-bench, where the key information flow from the tables to the reports for this task. The benchmark comprises 457 industrial tables, all derived from real-world scenarios and encompassing 19 industry domains as well as 4 types of industrial tables. Furthermore, we propose an evaluation criteria to fairly measure the quality of report generation. The experiments on 25 widely-used LLMs reveal that even state-of-the-art models like Deepseek-R1 only achieves performance with 62.71 overall score, indicating that LLMs still have room for improvement on T2R-bench. Source code and data will be available after acceptance.
PDF202September 2, 2025