時間考驗：評估語言模型在時間推理上的基準

摘要

大型語言模型（LLMs）展示了卓越的推理能力，但仍然容易出現錯誤，尤其是在涉及複雜時間邏輯的時間推理任務中。現有研究已經探討了LLM在使用不同數據集和基準時的時間推理表現。然而，這些研究通常依賴於LLM在預訓練期間可能遇到的現實世界數據，或者採用可能無意中引入事實不一致性的匿名化技術。在這項工作中，我們通過引入新穎的合成數據集來解決這些限制，這些數據集專門設計用於評估LLM在各種情境下的時間推理能力。這些數據集中的問題類型的多樣性使得能夠系統地研究問題結構、大小、問題類型、事實順序和其他因素對LLM表現的影響。我們的研究結果提供了有關當前LLM在時間推理任務中優勢和劣勢的寶貴見解。為了促進這一領域的進一步研究，我們正在公開數據集和評估框架，該框架用於我們的實驗：https://huggingface.co/datasets/baharef/ToT。

English

Large language models (LLMs) have showcased remarkable reasoning capabilities, yet they remain susceptible to errors, particularly in temporal reasoning tasks involving complex temporal logic. Existing research has explored LLM performance on temporal reasoning using diverse datasets and benchmarks. However, these studies often rely on real-world data that LLMs may have encountered during pre-training or employ anonymization techniques that can inadvertently introduce factual inconsistencies. In this work, we address these limitations by introducing novel synthetic datasets specifically designed to assess LLM temporal reasoning abilities in various scenarios. The diversity of question types across these datasets enables systematic investigation into the impact of the problem structure, size, question type, fact order, and other factors on LLM performance. Our findings provide valuable insights into the strengths and weaknesses of current LLMs in temporal reasoning tasks. To foster further research in this area, we are open-sourcing the datasets and evaluation framework used in our experiments: https://huggingface.co/datasets/baharef/ToT.

時間考驗：評估語言模型在時間推理上的基準

Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

摘要

Support