大型语言模型中时间推理的真正控制者：时间标记化还是时间表征？

摘要

我们推出MultiTempBench——一个涵盖三项任务的多语言时间推理基准测试集，包含日期运算、时区转换和时序关系抽取，涉及五种语言（英语、德语、中文、阿拉伯语和豪萨语）及多种历法体系（公历、回历和中国农历）。该基准通过翻译750道精编英文试题并扩展为受控日期格式变体，共包含15,000个测试样本。我们评估了20个大语言模型，引入经人工严重度校准的多语言日期碎片化比率（mDFR），并结合对内部时间表征的几何探测分析。研究发现：时间要素的分词质量是资源依赖型瓶颈——在低资源语言和稀有历法格式中，碎片化会破坏年月日信息的分离导致准确率崩溃，而高资源场景对数字级拆分通常具有鲁棒性。超越分词层面，交叉混合效应回归表明：在高资源语言中时间线性是时间推理的最强预测因子，而在低资源语言中碎片化程度才是更强预测指标。代码已开源：https://github.com/gagan3012/mtb

English

We present MultiTempBench, a multilingual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conversion, and temporal relation extraction across five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar). MultiTempBench contains 15,000 examples built by translating 750 curated English questions and expanding each into controlled date-format variants. We evaluate 20 LLMs and introduce the multilingual Date Fragmentation Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing analyses of internal temporal representations. We find tokenisation quality of temporal artefacts is a resource-dependent bottleneck: in low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day separation and accuracy collapses, while high-resource settings are often robust to digit-level splitting. Beyond tokenisation, crossed mixed-effects regression shows that temporal linearity is the strongest predictor of temporal reasoning in high-resource languages, whereas fragmentation is the stronger predictor in low-resource languages. Code is available at: https://github.com/gagan3012/mtb

大型语言模型中时间推理的真正控制者：时间标记化还是时间表征？

What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?

摘要

Support