시간의 시험: 시간적 추론 평가를 위한 LLM 벤치마크

초록

대규모 언어 모델(LLM)은 놀라운 추론 능력을 보여주었지만, 특히 복잡한 시간 논리를 포함하는 시간적 추론 작업에서 오류에 취약한 것으로 나타났습니다. 기존 연구는 다양한 데이터셋과 벤치마크를 사용하여 LLM의 시간적 추론 성능을 탐구해왔습니다. 그러나 이러한 연구들은 종종 LLM이 사전 학습 중에 접했을 가능성이 있는 실제 데이터에 의존하거나, 사실적 불일치를 의도치 않게 초래할 수 있는 익명화 기법을 사용합니다. 본 연구에서는 이러한 한계를 극복하기 위해 다양한 시나리오에서 LLM의 시간적 추론 능력을 평가하기 위해 특별히 설계된 새로운 합성 데이터셋을 도입합니다. 이 데이터셋들에 포함된 다양한 질문 유형은 문제 구조, 크기, 질문 유형, 사실 순서 및 기타 요인들이 LLM 성능에 미치는 영향을 체계적으로 조사할 수 있게 합니다. 우리의 연구 결과는 현재 LLM의 시간적 추론 작업에서의 강점과 약점에 대한 귀중한 통찰을 제공합니다. 이 분야의 추가 연구를 촉진하기 위해, 우리는 실험에 사용된 데이터셋과 평가 프레임워크를 오픈소스로 공개합니다: https://huggingface.co/datasets/baharef/ToT.

English

Large language models (LLMs) have showcased remarkable reasoning capabilities, yet they remain susceptible to errors, particularly in temporal reasoning tasks involving complex temporal logic. Existing research has explored LLM performance on temporal reasoning using diverse datasets and benchmarks. However, these studies often rely on real-world data that LLMs may have encountered during pre-training or employ anonymization techniques that can inadvertently introduce factual inconsistencies. In this work, we address these limitations by introducing novel synthetic datasets specifically designed to assess LLM temporal reasoning abilities in various scenarios. The diversity of question types across these datasets enables systematic investigation into the impact of the problem structure, size, question type, fact order, and other factors on LLM performance. Our findings provide valuable insights into the strengths and weaknesses of current LLMs in temporal reasoning tasks. To foster further research in this area, we are open-sourcing the datasets and evaluation framework used in our experiments: https://huggingface.co/datasets/baharef/ToT.

시간의 시험: 시간적 추론 평가를 위한 LLM 벤치마크

Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

초록

Support