長上下文語言模型的可控性檢驗

摘要

現有評估長上下文語言模型（LCLM）的框架大致可分為現實世界任務和合成任務兩類。儘管這兩種方法都具有實用性，但它們各自伴隨著一些固有的侷限性。現實世界任務過於複雜，難以解釋或表徵，且易受數據污染的影響。相比之下，合成任務通常採用“大海撈針”（NIAH）的形式，其中“針”與“乾草堆”之間缺乏連貫性，這削弱了它們作為現實應用代理的有效性。針對這些挑戰，我們認為理想的長上下文評估框架應具備三個基本特徵：無縫上下文、可控設置和健全評估。本研究引入了LongBioBench，這是一個新穎的基準測試，它利用人工生成的傳記作為受控環境，從理解、推理和可信度三個維度評估LCLM。我們的實驗評估共涵蓋了18個LCLM，結果表明，大多數模型在語義理解和對檢索結果的基本推理方面仍存在不足，且隨著上下文長度的增加，其可信度降低。進一步分析顯示，現有合成基準測試採用的某些設計選擇，如上下文不連貫、數字針以及缺乏干擾項，使得它們在測試模型長上下文能力時存在漏洞。此外，我們還發現，長上下文持續預訓練主要通過調整RoPE嵌入來適應延長的上下文長度。總而言之，與之前的合成基準測試相比，LongBioBench在模擬真實語言任務與保持可控性之間實現了更好的平衡，且具有高度的可解釋性和可配置性。

English

Existing frameworks for evaluating long-context language models (LCLM) can be broadly categorized into real-world and synthetic tasks. Despite their utility, both approaches are accompanied by certain intrinsic limitations. Real-world tasks are too complex to interpret or characterize and are susceptible to data contamination. In contrast, synthetic tasks often adopt the needle-in-the-haystack (NIAH) format, wherein a lack of coherence between the "needle" and the "haystack" compromises their validity as proxies for realistic applications. In response to these challenges, we posit that an ideal long-context evaluation framework should be characterized by three essential features: seamless context, controllable setting, and sound evaluation. This study introduces LongBioBench, a novel benchmark that utilizes artificially generated biographies as a controlled environment for assessing LCLMs across dimensions of understanding, reasoning, and trustworthiness. Our experimental evaluation, which includes 18 LCLMs in total, demonstrates that most models still exhibit deficiencies in semantic understanding and elementary reasoning over retrieved results and are less trustworthy as context length increases. Our further analysis indicates some design choices employed by existing synthetic benchmarks, such as contextual non-coherence, numerical needles, and the absence of distractors, rendering them vulnerable to test the model long-context capabilities. Moreover, we also reveal that long-context continual pretraining primarily adjusts RoPE embedding to accommodate extended context lengths. To sum up, compared to previous synthetic benchmarks, LongBioBench achieves a better trade-off between mirroring authentic language tasks and maintaining controllability, and is highly interpretable and configurable.

長上下文語言模型的可控性檢驗

A Controllable Examination for Long-Context Language Models

摘要

Support