ChatPaper.aiChatPaper

TableBench:一项针对表格问题回答的全面而复杂的基准测试

TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

August 17, 2024
作者: Xianjie Wu, Jian Yang, Linzheng Chai, Ge Zhang, Jiaheng Liu, Xinrun Du, Di Liang, Daixin Shu, Xianfu Cheng, Tianzhen Sun, Guanglin Niu, Tongliang Li, Zhoujun Li
cs.AI

摘要

最近大型语言模型(LLMs)的进展显著增强了对表格数据的解释和处理能力,引入了以前难以想象的能力。尽管取得了这些成就,LLMs在工业场景中的应用仍然面临重大挑战,特别是由于处理真实世界表格数据所需的推理复杂性增加,突显了学术基准和实际应用之间的显著差距。为了解决这一差距,我们对工业场景中表格数据的应用进行了详细调查,并提出了一个全面且复杂的基准TableBench,包括四个主要类别中的18个领域的表格问答(TableQA)能力。此外,我们引入了TableLLM,该模型在我们精心构建的训练集TableInstruct上训练,实现了与GPT-3.5可比的性能。在TableBench上进行的大量实验表明,无论是开源还是专有的LLMs,仍然有很大的改进空间以满足现实世界的需求,其中最先进的模型GPT-4与人类相比仅获得了适度的得分。
English
Recent advancements in Large Language Models (LLMs) have markedly enhanced the interpretation and processing of tabular data, introducing previously unimaginable capabilities. Despite these achievements, LLMs still encounter significant challenges when applied in industrial scenarios, particularly due to the increased complexity of reasoning required with real-world tabular data, underscoring a notable disparity between academic benchmarks and practical applications. To address this discrepancy, we conduct a detailed investigation into the application of tabular data in industrial scenarios and propose a comprehensive and complex benchmark TableBench, including 18 fields within four major categories of table question answering (TableQA) capabilities. Furthermore, we introduce TableLLM, trained on our meticulously constructed training set TableInstruct, achieving comparable performance with GPT-3.5. Massive experiments conducted on TableBench indicate that both open-source and proprietary LLMs still have significant room for improvement to meet real-world demands, where the most advanced model, GPT-4, achieves only a modest score compared to humans.

Summary

AI-Generated Summary

PDF533November 17, 2024