大規模言語モデルにおける時間推論を真に制御するものは何か：時間のトークン化か、表現か？

要旨

我々は、多言語時間推論ベンチマーク「MultiTempBench」を提案する。これは、5つの言語（英語、ドイツ語、中国語、アラビア語、ハウサ語）と複数の暦法（グレゴリオ暦、ヒジュラ暦、中国旧暦）にわたる、日付計算、タイムゾーン変換、時間関係抽出の3つのタスクを対象としている。MultiTempBenchは、厳選された750の英語質問を翻訳し、それぞれを制御された日付形式バリエーションに展開して構築された15,000の例を含む。20の大規模言語モデルを評価し、人間の深刻度評価で較正された多言語日付断片化率（mDFR）を導入するとともに、内部の時間表現に対する幾何学的プロービング分析を実施した。その結果、時間的要素のトークン化品質はリソース依存のボトルネックであることが明らかになった。低リソース言語や稀な暦形式では、断片化により年/月/日の分離が妨げられ精度が急落する一方、高リソース環境では数字レベルの分割に対してしばしば頑健であった。トークン化を超えて、交差混合効果回帰分析により、高リソース言語では時間的直線性が時間推論の最強の予測因子であるのに対し、低リソース言語では断片化がより強い予測因子であることが示された。コードは以下で利用可能：https://github.com/gagan3012/mtb

English

We present MultiTempBench, a multilingual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conversion, and temporal relation extraction across five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar). MultiTempBench contains 15,000 examples built by translating 750 curated English questions and expanding each into controlled date-format variants. We evaluate 20 LLMs and introduce the multilingual Date Fragmentation Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing analyses of internal temporal representations. We find tokenisation quality of temporal artefacts is a resource-dependent bottleneck: in low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day separation and accuracy collapses, while high-resource settings are often robust to digit-level splitting. Beyond tokenisation, crossed mixed-effects regression shows that temporal linearity is the strongest predictor of temporal reasoning in high-resource languages, whereas fragmentation is the stronger predictor in low-resource languages. Code is available at: https://github.com/gagan3012/mtb

大規模言語モデルにおける時間推論を真に制御するものは何か：時間のトークン化か、表現か？

What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?

要旨

Support