日期片段：时间推理中分词机制的潜在瓶颈

摘要

现代BPE分词器常将日历日期分割成无意义的片段，例如20250312被拆分为202、503、12，这不仅增加了词汇量，还模糊了进行稳健时间推理所需的内在结构。在本研究中，我们（1）引入了一种简单且可解释的度量标准，称为日期碎片化比率，用于衡量分词器在多大程度上忠实保留了多位数日期组件；（2）发布了DateAugBench，这是一套包含6500个示例的测试集，涵盖三项时间推理任务：基于上下文的日期解析、格式不变性谜题以及跨越历史、当代和未来时期的日期运算；（3）通过层级探测和因果注意力跳转分析，揭示了一种新兴的日期抽象机制，大型语言模型借此将月、日、年组件碎片拼接起来进行时间推理。我们的实验表明，过度的碎片化与罕见日期（如历史和未来日期）上高达10个百分点的准确率下降相关。此外，我们发现模型越大，完成修复日期碎片的新兴日期抽象过程越快。最后，我们观察到大型语言模型在组装日期碎片时遵循的推理路径，通常与人类的理解方式（年→月→日）有所不同。

English

Modern BPE tokenizers often split calendar dates into meaningless fragments, e.g., 20250312 rightarrow 202, 503, 12, inflating token counts and obscuring the inherent structure needed for robust temporal reasoning. In this work, we (1) introduce a simple yet interpretable metric, termed date fragmentation ratio, that measures how faithfully a tokenizer preserves multi-digit date components; (2) release DateAugBench, a suite of 6500 examples spanning three temporal reasoning tasks: context-based date resolution, format-invariance puzzles, and date arithmetic across historical, contemporary, and future regimes; and (3) through layer-wise probing and causal attention-hop analyses, uncover an emergent date-abstraction mechanism whereby large language models stitch together the fragments of month, day, and year components for temporal reasoning. Our experiments show that excessive fragmentation correlates with accuracy drops of up to 10 points on uncommon dates like historical and futuristic dates. Further, we find that the larger the model, the faster the emergent date abstraction that heals date fragments is accomplished. Lastly, we observe a reasoning path that LLMs follow to assemble date fragments, typically differing from human interpretation (year rightarrow month rightarrow day).

日期片段：时间推理中分词机制的潜在瓶颈

Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning

摘要

Support