RULER:你的長文本語言模型的真實上下文大小是多少?
RULER: What's the Real Context Size of Your Long-Context Language Models?
April 9, 2024
作者: Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Boris Ginsburg
cs.AI
摘要
在長文本語言模型(LM)的評估中,「大海捞针」(NIAH)測試被廣泛採用,該測試檢驗檢索長文本(「大海」)中的信息片段(「針」)的能力。然而,這種簡單的基於檢索的測試僅表明了一種表面形式的長文本理解。為了更全面地評估長文本LM,我們創建了一個新的合成基準RULER,具有靈活的配置,可定制序列長度和任務複雜度。RULER擴展了普通的NIAH測試,包括各種類型和數量的針的變化。此外,RULER引入了新的任務類別多跳追踪和聚合,以測試超出從上下文搜索的行為。我們在RULER中對十個長文本LM進行了13個代表性任務的評估。儘管在普通的NIAH測試中實現了幾乎完美的準確性,但所有模型在上下文長度增加時都表現出明顯的性能下降。儘管這些模型都聲稱具有32K令牌或更多的上下文大小,但只有四個模型(GPT-4、Command-R、Yi-34B和Mixtral)能夠在32K的長度上保持令人滿意的性能。我們對支持200K上下文長度的Yi-34B進行的分析顯示,在增加輸入長度和任務複雜度時,還有很大的改進空間。我們開源RULER以促進對長文本LM的全面評估。
English
The needle-in-a-haystack (NIAH) test, which examines the ability to retrieve
a piece of information (the "needle") from long distractor texts (the
"haystack"), has been widely adopted to evaluate long-context language models
(LMs). However, this simple retrieval-based test is indicative of only a
superficial form of long-context understanding. To provide a more comprehensive
evaluation of long-context LMs, we create a new synthetic benchmark RULER with
flexible configurations for customized sequence length and task complexity.
RULER expands upon the vanilla NIAH test to encompass variations with diverse
types and quantities of needles. Moreover, RULER introduces new task categories
multi-hop tracing and aggregation to test behaviors beyond searching from
context. We evaluate ten long-context LMs with 13 representative tasks in
RULER. Despite achieving nearly perfect accuracy in the vanilla NIAH test, all
models exhibit large performance drops as the context length increases. While
these models all claim context sizes of 32K tokens or greater, only four models
(GPT-4, Command-R, Yi-34B, and Mixtral) can maintain satisfactory performance
at the length of 32K. Our analysis of Yi-34B, which supports context length of
200K, reveals large room for improvement as we increase input length and task
complexity. We open source RULER to spur comprehensive evaluation of
long-context LMs.Summary
AI-Generated Summary