ChatPaper.aiChatPaper

日期片段:時間推理中分詞處理的隱藏瓶頸

Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning

May 22, 2025
作者: Gagan Bhatia, Maxime Peyrard, Wei Zhao
cs.AI

摘要

现代BPE分词器常将日历日期分割成无意义的片段,例如将20250312分解为202、503、12,这不仅增加了token数量,还模糊了进行稳健时间推理所需的内在结构。在本研究中,我们(1)引入了一个简单但可解释的指标,称为日期碎片化比率,用于衡量分词器对多位数日期成分的保留程度;(2)发布了DateAugBench,这是一个包含6500个示例的测试集,涵盖三个时间推理任务:基于上下文的日期解析、格式不变性谜题以及跨越历史、当代和未来时期的日期算术;(3)通过层次化探测和因果注意力跳分析,揭示了一种新兴的日期抽象机制,大型语言模型通过该机制将月、日、年成分的片段拼接起来进行时间推理。我们的实验表明,过度的碎片化与罕见日期(如历史和未来日期)上高达10个百分点的准确率下降相关。此外,我们发现模型越大,完成修复日期片段的新兴日期抽象过程就越快。最后,我们观察到大型语言模型在组装日期片段时遵循的推理路径,通常与人类的理解(年→月→日)有所不同。
English
Modern BPE tokenizers often split calendar dates into meaningless fragments, e.g., 20250312 rightarrow 202, 503, 12, inflating token counts and obscuring the inherent structure needed for robust temporal reasoning. In this work, we (1) introduce a simple yet interpretable metric, termed date fragmentation ratio, that measures how faithfully a tokenizer preserves multi-digit date components; (2) release DateAugBench, a suite of 6500 examples spanning three temporal reasoning tasks: context-based date resolution, format-invariance puzzles, and date arithmetic across historical, contemporary, and future regimes; and (3) through layer-wise probing and causal attention-hop analyses, uncover an emergent date-abstraction mechanism whereby large language models stitch together the fragments of month, day, and year components for temporal reasoning. Our experiments show that excessive fragmentation correlates with accuracy drops of up to 10 points on uncommon dates like historical and futuristic dates. Further, we find that the larger the model, the faster the emergent date abstraction that heals date fragments is accomplished. Lastly, we observe a reasoning path that LLMs follow to assemble date fragments, typically differing from human interpretation (year rightarrow month rightarrow day).

Summary

AI-Generated Summary

PDF22May 23, 2025