大型语言模型中时间推理的真正控制者：时间标记化还是时间表征？

摘要

我们推出MultiTempBench——一个涵盖三种任务（日期运算、时区转换和时间关系抽取）、五种语言（英语、德语、中文、阿拉伯语和豪萨语）及多种历法体系（公历、回历和中国农历）的多语言时序推理基准。该基准通过翻译750道精选英文问题并扩展为受控日期格式变体，构建了15,000个测试样本。我们评估了20个大语言模型，引入经人工严重度校准的多语言日期碎片化比率（mDFR），并对内部时序表征进行几何探测分析。研究发现：时序要素的分词质量是资源依赖型瓶颈——在低资源语言和稀有历法格式中，碎片化会破坏年月日信息的完整性导致准确率崩溃，而高资源场景对数字级拆分通常具有鲁棒性。超越分词层面，交叉混合效应回归表明：在高资源语言中时序线性是时序推理的最强预测因子，而在低资源语言中碎片化程度更具预测力。代码详见：https://github.com/gagan3012/mtb

English

We present MultiTempBench, a multilingual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conversion, and temporal relation extraction across five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar). MultiTempBench contains 15,000 examples built by translating 750 curated English questions and expanding each into controlled date-format variants. We evaluate 20 LLMs and introduce the multilingual Date Fragmentation Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing analyses of internal temporal representations. We find tokenisation quality of temporal artefacts is a resource-dependent bottleneck: in low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day separation and accuracy collapses, while high-resource settings are often robust to digit-level splitting. Beyond tokenisation, crossed mixed-effects regression shows that temporal linearity is the strongest predictor of temporal reasoning in high-resource languages, whereas fragmentation is the stronger predictor in low-resource languages. Code is available at: https://github.com/gagan3012/mtb