時間考驗:評估語言模型在時間推理上的基準
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning
June 13, 2024
作者: Bahare Fatemi, Mehran Kazemi, Anton Tsitsulin, Karishma Malkan, Jinyeong Yim, John Palowitch, Sungyong Seo, Jonathan Halcrow, Bryan Perozzi
cs.AI
摘要
大型語言模型(LLMs)展示了卓越的推理能力,但仍然容易出現錯誤,尤其是在涉及複雜時間邏輯的時間推理任務中。現有研究已經探討了LLM在使用不同數據集和基準時的時間推理表現。然而,這些研究通常依賴於LLM在預訓練期間可能遇到的現實世界數據,或者採用可能無意中引入事實不一致性的匿名化技術。在這項工作中,我們通過引入新穎的合成數據集來解決這些限制,這些數據集專門設計用於評估LLM在各種情境下的時間推理能力。這些數據集中的問題類型的多樣性使得能夠系統地研究問題結構、大小、問題類型、事實順序和其他因素對LLM表現的影響。我們的研究結果提供了有關當前LLM在時間推理任務中優勢和劣勢的寶貴見解。為了促進這一領域的進一步研究,我們正在公開數據集和評估框架,該框架用於我們的實驗:https://huggingface.co/datasets/baharef/ToT。
English
Large language models (LLMs) have showcased remarkable reasoning
capabilities, yet they remain susceptible to errors, particularly in temporal
reasoning tasks involving complex temporal logic. Existing research has
explored LLM performance on temporal reasoning using diverse datasets and
benchmarks. However, these studies often rely on real-world data that LLMs may
have encountered during pre-training or employ anonymization techniques that
can inadvertently introduce factual inconsistencies. In this work, we address
these limitations by introducing novel synthetic datasets specifically designed
to assess LLM temporal reasoning abilities in various scenarios. The diversity
of question types across these datasets enables systematic investigation into
the impact of the problem structure, size, question type, fact order, and other
factors on LLM performance. Our findings provide valuable insights into the
strengths and weaknesses of current LLMs in temporal reasoning tasks. To foster
further research in this area, we are open-sourcing the datasets and evaluation
framework used in our experiments: https://huggingface.co/datasets/baharef/ToT.Summary
AI-Generated Summary