ChroKnowledge：揭示语言模型在多个领域中的时间知识

摘要

大型语言模型（LLMs）已经显著影响了我们生活的许多方面。然而，评估和确保它们的时间顺序知识仍然具有挑战性。现有方法在处理知识的累积性质时存在不足，通常依赖于单个时间戳。为了克服这一问题，我们引入了ChroKnowBench，这是一个旨在评估跨三个关键方面的时间累积知识的基准数据集：多个领域、时间依赖性、时间状态。我们的基准数据集区分了不断发展的知识（例如科学发现、修订法律）和保持不变的知识（例如数学真理、常识事实）。基于这一基准数据集，我们提出了ChroKnowledge（知识的时间分类），这是一个用于评估和更新LLMs非参数化时间知识的新型基于抽样的框架。我们的评估显示：（1）引出时间知识的能力取决于模型训练的数据格式。（2）LLMs部分地回忆知识，或者在时间边界处出现截断，而不是完全正确地回忆所有知识方面。因此，我们应用我们的ChroKnowPrompt，通过逐步遍历周围的时间跨度来引出时间知识的深入提示。我们观察到，我们的框架成功地更新了整个时间轴上的总体知识，无论是在生物医学领域（+11.9%）还是在一般领域（+2.8%），展示了其在完善时间知识方面的有效性。这种非参数化方法还使得知识更新不仅适用于开源模型，还适用于专有LLMs，确保了在各种模型类型中的全面适用性。我们基于ChroKnowPrompt的时间特征进行了全面分析，并验证了各种模型通过我们的方法引出内在时间知识的潜力。

English

Large language models (LLMs) have significantly impacted many aspects of our lives. However, assessing and ensuring their chronological knowledge remains challenging. Existing approaches fall short in addressing the accumulative nature of knowledge, often relying on a single time stamp. To overcome this, we introduce ChroKnowBench, a benchmark dataset designed to evaluate chronologically accumulated knowledge across three key aspects: multiple domains, time dependency, temporal state. Our benchmark distinguishes between knowledge that evolves (e.g., scientific discoveries, amended laws) and knowledge that remain constant (e.g., mathematical truths, commonsense facts). Building on this benchmark, we present ChroKnowledge (Chronological Categorization of Knowledge), a novel sampling-based framework for evaluating and updating LLMs' non-parametric chronological knowledge. Our evaluation shows: (1) The ability of eliciting temporal knowledge varies depending on the data format that model was trained on. (2) LLMs partially recall knowledge or show a cut-off at temporal boundaries rather than recalling all aspects of knowledge correctly. Thus, we apply our ChroKnowPrompt, an in-depth prompting to elicit chronological knowledge by traversing step-by-step through the surrounding time spans. We observe that our framework successfully updates the overall knowledge across the entire timeline in both the biomedical domain (+11.9%) and the general domain (+2.8%), demonstrating its effectiveness in refining temporal knowledge. This non-parametric approach also enables knowledge updates not only in open-source models but also in proprietary LLMs, ensuring comprehensive applicability across model types. We perform a comprehensive analysis based on temporal characteristics of ChroKnowPrompt and validate the potential of various models to elicit intrinsic temporal knowledge through our method.

ChroKnowledge：揭示语言模型在多个领域中的时间知识

ChroKnowledge: Unveiling Chronological Knowledge of Language Models in Multiple Domains

摘要

Support