ChatPaper.aiChatPaper

时间考验:评估大型语言模型在时间推理上的基准测试

Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

June 13, 2024
作者: Bahare Fatemi, Mehran Kazemi, Anton Tsitsulin, Karishma Malkan, Jinyeong Yim, John Palowitch, Sungyong Seo, Jonathan Halcrow, Bryan Perozzi
cs.AI

摘要

大型语言模型(LLMs)展示了出色的推理能力,但仍然容易出现错误,特别是在涉及复杂时间逻辑的时间推理任务中。现有研究探讨了LLM在使用不同数据集和基准测试进行时间推理时的表现。然而,这些研究通常依赖于LLMs在预训练期间可能遇到的真实世界数据,或者采用可能无意中引入事实不一致性的匿名化技术。在这项工作中,我们通过引入专门设计用于评估LLM时间推理能力的新颖合成数据集来解决这些限制。这些数据集中的问题类型的多样性使得可以系统地研究问题结构、大小、问题类型、事实顺序以及其他因素对LLM性能的影响。我们的研究结果为当前LLM在时间推理任务中的优势和劣势提供了宝贵的见解。为了促进这一领域的进一步研究,我们正在开源我们实验中使用的数据集和评估框架:https://huggingface.co/datasets/baharef/ToT。
English
Large language models (LLMs) have showcased remarkable reasoning capabilities, yet they remain susceptible to errors, particularly in temporal reasoning tasks involving complex temporal logic. Existing research has explored LLM performance on temporal reasoning using diverse datasets and benchmarks. However, these studies often rely on real-world data that LLMs may have encountered during pre-training or employ anonymization techniques that can inadvertently introduce factual inconsistencies. In this work, we address these limitations by introducing novel synthetic datasets specifically designed to assess LLM temporal reasoning abilities in various scenarios. The diversity of question types across these datasets enables systematic investigation into the impact of the problem structure, size, question type, fact order, and other factors on LLM performance. Our findings provide valuable insights into the strengths and weaknesses of current LLMs in temporal reasoning tasks. To foster further research in this area, we are open-sourcing the datasets and evaluation framework used in our experiments: https://huggingface.co/datasets/baharef/ToT.

Summary

AI-Generated Summary

PDF281December 6, 2024