大型语言模型能否跟上步伐？基于持续知识流的在线适应能力基准测试

摘要

在动态现实场景中运行的大型语言模型常面临持续演进或渐进涌现的知识。为保持准确性与有效性，模型必须实时适应不断更新的信息。我们提出"持续知识流的在线自适应"评估框架，为流式持续更新知识环境中的在线适应能力建立基准。该基准具体呈现为细粒度语境片段序列，其中事实会随时间区间动态变化。OAKS包含OAKS-BABI和OAKS-Novel两个数据集，单个事实会在不同语境片段间经历多次演变。这些数据集通过密集标注来测量模型追踪变化的准确性。通过对14种采用不同推理方法的模型进行评估，我们发现现有方法存在显著局限：无论是前沿模型还是具有记忆功能的智能体系统，在OAKS基准中均未能实现稳健适应，表现出状态追踪延迟以及易受流式环境干扰的特性。

English

LLMs operating in dynamic real-world contexts often encounter knowledge that evolves continuously or emerges incrementally. To remain accurate and effective, models must adapt to newly arriving information on the fly. We introduce Online Adaptation to Continual Knowledge Streams(OAKS) to evaluate this capability, establishing a benchmark for online adaptation over streaming, continually updating knowledge. Specifically, the benchmark is structured as a sequence of fine-grained context chunks where facts change dynamically across time intervals. OAKS comprises two datasets: OAKS-BABI and OAKS-Novel, where individual facts evolve multiple times across context chunks. These datasets include dense annotations to measure whether models track changes accurately. Evaluating 14 models with varied inference approaches, we observe significant limitations in current methodologies. Both state-of-the-art models and agentic memory systems fail to adapt robustly on OAKS, demonstrating delays in state-tracking and susceptibility to distraction within streaming environments.

大型语言模型能否跟上步伐？基于持续知识流的在线适应能力基准测试

Can Large Language Models Keep Up? Benchmarking Online Adaptation to Continual Knowledge Streams

摘要

Support