长上下文语言模型的可控性检验
A Controllable Examination for Long-Context Language Models
June 3, 2025
作者: Yijun Yang, Zeyu Huang, Wenhao Zhu, Zihan Qiu, Fei Yuan, Jeff Z. Pan, Ivan Titov
cs.AI
摘要
现有的长上下文语言模型(LCLM)评估框架大致可分为现实任务与合成任务两大类。尽管这两类方法各有其价值,但它们均存在固有的局限性。现实任务过于复杂,难以解释或特征化,且易受数据污染的影响。相比之下,合成任务常采用“大海捞针”(NIAH)的形式,其中“针”与“草堆”之间缺乏连贯性,削弱了其作为真实应用场景代理的有效性。针对这些挑战,我们提出理想的长期上下文评估框架应具备三大核心特征:无缝上下文、可控设置及健全评估。本研究引入了LongBioBench,一个创新性基准,它利用人工生成的传记作为受控环境,从理解、推理及可信度三个维度对LCLMs进行评估。我们的实验评估涵盖了18种LCLMs,结果显示,大多数模型在语义理解及对检索结果的基本推理方面仍存在不足,且随着上下文长度的增加,其可信度降低。进一步分析指出,现有合成基准采用的一些设计选择,如上下文不连贯、数值型“针”及缺乏干扰项,使其在测试模型长上下文能力时显得脆弱。此外,我们还发现,长上下文持续预训练主要通过调整RoPE嵌入以适应扩展的上下文长度。综上所述,与以往的合成基准相比,LongBioBench在模拟真实语言任务与保持可控性之间实现了更好的平衡,并具有高度的可解释性和可配置性。
English
Existing frameworks for evaluating long-context language models (LCLM) can be
broadly categorized into real-world and synthetic tasks. Despite their utility,
both approaches are accompanied by certain intrinsic limitations. Real-world
tasks are too complex to interpret or characterize and are susceptible to data
contamination. In contrast, synthetic tasks often adopt the
needle-in-the-haystack (NIAH) format, wherein a lack of coherence between the
"needle" and the "haystack" compromises their validity as proxies for realistic
applications. In response to these challenges, we posit that an ideal
long-context evaluation framework should be characterized by three essential
features: seamless context, controllable setting, and
sound evaluation. This study introduces LongBioBench, a
novel benchmark that utilizes artificially generated biographies as a
controlled environment for assessing LCLMs across dimensions of
understanding, reasoning, and trustworthiness.
Our experimental evaluation, which includes 18 LCLMs in total,
demonstrates that most models still exhibit deficiencies in semantic
understanding and elementary reasoning over retrieved results and are less
trustworthy as context length increases. Our further analysis indicates some
design choices employed by existing synthetic benchmarks, such as contextual
non-coherence, numerical needles, and the absence of distractors, rendering
them vulnerable to test the model long-context capabilities. Moreover, we also
reveal that long-context continual pretraining primarily adjusts RoPE embedding
to accommodate extended context lengths. To sum up, compared to previous
synthetic benchmarks, LongBioBench achieves a better trade-off between
mirroring authentic language tasks and maintaining controllability, and is
highly interpretable and configurable.