대규모 언어 모델에서 시간 추론을 실제로 제어하는 요소는 무엇인가: 시간의 토큰화인가, 표현인가?

초록

본 연구에서는 세 가지 과제(날짜 연산, 시간대 변환, 시간적 관계 추출)를 아우르는 다국어 시간 추론 벤치마크인 MultiTempBench를 소개한다. 이 벤치마크는 다섯 개 언어(영어, 독일어, 중국어, 아랍어, 하우사어)와 여러 역법(그레고리력, 히즈라력, 중국 음력)을 포괄한다. MultiTempBench는 정성적으로 구성된 영어 질문 750개를 번역하고 각각을 통제된 날짜 형식 변형으로 확장하여 총 15,000개의 예시로 구성되었다. 우리는 20개의 대규모 언어 모델(LLM)을 평가하고, 인간의 심각도 평가로 보정된 다국어 날짜 분할 비율(mDFR)을 도입함과 동시에 내부 시간 표현에 대한 기하학적 프로빙 분석을 수행했다. 분석 결과, 시간적 요소의 토큰화 품질이 자원 의존적 병목 현상임을 확인했다: 저자원 언어와 희귀 역법에서는 분할이 연/월/일 구분을 방해하여 정확도가 급락한 반면, 고자원 환경에서는 숫자 수준 분할에 대체로 강건하였다. 토큰화를 넘어선 교차 혼합 효과 회귀 분석에 따르면, 고자원 언어에서는 시간적 선형성이 시간 추론의 가장 강력한 예측 변수인 반면, 저자원 언어에서는 분할 정도가 더 강력한 예측 변수로 나타났다. 코드는 https://github.com/gagan3012/mtb에서 확인할 수 있다.

English

We present MultiTempBench, a multilingual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conversion, and temporal relation extraction across five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar). MultiTempBench contains 15,000 examples built by translating 750 curated English questions and expanding each into controlled date-format variants. We evaluate 20 LLMs and introduce the multilingual Date Fragmentation Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing analyses of internal temporal representations. We find tokenisation quality of temporal artefacts is a resource-dependent bottleneck: in low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day separation and accuracy collapses, while high-resource settings are often robust to digit-level splitting. Beyond tokenisation, crossed mixed-effects regression shows that temporal linearity is the strongest predictor of temporal reasoning in high-resource languages, whereas fragmentation is the stronger predictor in low-resource languages. Code is available at: https://github.com/gagan3012/mtb

대규모 언어 모델에서 시간 추론을 실제로 제어하는 요소는 무엇인가: 시간의 토큰화인가, 표현인가?

What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?

초록

Support