T2R-bench: 実世界の産業用テーブルから記事レベルのレポートを生成するためのベンチマーク

要旨

大規模言語モデル（LLMs）の表推論能力を探るための広範な研究が行われてきた。しかし、表の情報をレポートに変換するという本質的なタスクは、産業応用において依然として重要な課題である。このタスクは、以下の2つの重大な問題に悩まされている：1）表の複雑さと多様性が最適でない推論結果を引き起こすこと；2）既存の表ベンチマークがこのタスクの実用的な応用を適切に評価する能力を欠いていること。このギャップを埋めるため、我々は表からレポートへのタスクを提案し、このタスクのための表からレポートへの主要な情報フローをカバーする二言語ベンチマーク「T2R-bench」を構築した。このベンチマークは、実世界のシナリオから得られた457の産業用表を含み、19の産業ドメインと4種類の産業用表を網羅している。さらに、レポート生成の品質を公平に測定するための評価基準を提案した。25の広く使用されているLLMを用いた実験では、Deepseek-R1のような最先端のモデルでさえ62.71の総合スコアしか達成できず、LLMがT2R-benchにおいてまだ改善の余地があることを示している。ソースコードとデータは受理後に公開される予定である。

English

Extensive research has been conducted to explore the capabilities of large language models (LLMs) in table reasoning. However, the essential task of transforming tables information into reports remains a significant challenge for industrial applications. This task is plagued by two critical issues: 1) the complexity and diversity of tables lead to suboptimal reasoning outcomes; and 2) existing table benchmarks lack the capacity to adequately assess the practical application of this task. To fill this gap, we propose the table-to-report task and construct a bilingual benchmark named T2R-bench, where the key information flow from the tables to the reports for this task. The benchmark comprises 457 industrial tables, all derived from real-world scenarios and encompassing 19 industry domains as well as 4 types of industrial tables. Furthermore, we propose an evaluation criteria to fairly measure the quality of report generation. The experiments on 25 widely-used LLMs reveal that even state-of-the-art models like Deepseek-R1 only achieves performance with 62.71 overall score, indicating that LLMs still have room for improvement on T2R-bench. Source code and data will be available after acceptance.