TableBench:一個針對表格問答的全面且複雜的基準測試
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering
August 17, 2024
作者: Xianjie Wu, Jian Yang, Linzheng Chai, Ge Zhang, Jiaheng Liu, Xinrun Du, Di Liang, Daixin Shu, Xianfu Cheng, Tianzhen Sun, Guanglin Niu, Tongliang Li, Zhoujun Li
cs.AI
摘要
近期對大型語言模型(LLMs)的進展顯著增強了對表格數據的解釋和處理,引入了以前難以想像的能力。儘管取得了這些成就,LLMs在應用於工業場景時仍然遇到重大挑戰,特別是由於現實世界表格數據所需的推理複雜性增加,凸顯了學術基準和實際應用之間的顯著差距。為解決這一差異,我們對工業場景中表格數據的應用進行了詳細調查,並提出了一個全面且複雜的基準TableBench,其中包括四個主要類別中的18個字段的表格問答(TableQA)能力。此外,我們引入了TableLLM,該模型在我們精心構建的訓練集TableInstruct上訓練,實現了與GPT-3.5可比的性能。在TableBench上進行的大量實驗表明,無論是開源還是專有的LLMs,仍有很大的改進空間以滿足現實需求,其中最先進的模型GPT-4與人類相比僅達到了較低的得分。
English
Recent advancements in Large Language Models (LLMs) have markedly enhanced
the interpretation and processing of tabular data, introducing previously
unimaginable capabilities. Despite these achievements, LLMs still encounter
significant challenges when applied in industrial scenarios, particularly due
to the increased complexity of reasoning required with real-world tabular data,
underscoring a notable disparity between academic benchmarks and practical
applications. To address this discrepancy, we conduct a detailed investigation
into the application of tabular data in industrial scenarios and propose a
comprehensive and complex benchmark TableBench, including 18 fields within four
major categories of table question answering (TableQA) capabilities.
Furthermore, we introduce TableLLM, trained on our meticulously constructed
training set TableInstruct, achieving comparable performance with GPT-3.5.
Massive experiments conducted on TableBench indicate that both open-source and
proprietary LLMs still have significant room for improvement to meet real-world
demands, where the most advanced model, GPT-4, achieves only a modest score
compared to humans.Summary
AI-Generated Summary