ChatPaper.aiChatPaper

RULER:你的长上下文语言模型的真实上下文大小是多少?

RULER: What's the Real Context Size of Your Long-Context Language Models?

April 9, 2024
作者: Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Boris Ginsburg
cs.AI

摘要

在长文本语境模型的评估中,针对草堆中的针(NIAH)测试被广泛采用,该测试检验从长篇幕后文本(“草堆”)中检索信息(“针”)的能力。然而,这种简单的基于检索的测试只能表明一种表面形式的长文本理解能力。为了更全面地评估长文本语境模型,我们创建了一个新的合成基准RULER,具有灵活的配置,可定制序列长度和任务复杂性。RULER在基础的NIAH测试基础上进行了扩展,涵盖了具有不同类型和数量针的变体。此外,RULER引入了新的任务类别,如多跳追踪和聚合,以测试超越从语境中搜索的行为。我们在RULER中评估了十个长文本语境模型,涵盖了13个代表性任务。尽管在基础的NIAH测试中几乎达到完美的准确率,但所有模型在语境长度增加时都表现出较大的性能下降。尽管这些模型都声称支持32K令牌或更大的上下文大小,但只有四个模型(GPT-4、Command-R、Yi-34B和Mixtral)能够在32K长度时保持令人满意的性能。我们对支持200K上下文长度的Yi-34B进行的分析显示,在增加输入长度和任务复杂性时,还有很大的改进空间。我们开放源代码RULER,以促进对长文本语境模型的全面评估。
English
The needle-in-a-haystack (NIAH) test, which examines the ability to retrieve a piece of information (the "needle") from long distractor texts (the "haystack"), has been widely adopted to evaluate long-context language models (LMs). However, this simple retrieval-based test is indicative of only a superficial form of long-context understanding. To provide a more comprehensive evaluation of long-context LMs, we create a new synthetic benchmark RULER with flexible configurations for customized sequence length and task complexity. RULER expands upon the vanilla NIAH test to encompass variations with diverse types and quantities of needles. Moreover, RULER introduces new task categories multi-hop tracing and aggregation to test behaviors beyond searching from context. We evaluate ten long-context LMs with 13 representative tasks in RULER. Despite achieving nearly perfect accuracy in the vanilla NIAH test, all models exhibit large performance drops as the context length increases. While these models all claim context sizes of 32K tokens or greater, only four models (GPT-4, Command-R, Yi-34B, and Mixtral) can maintain satisfactory performance at the length of 32K. Our analysis of Yi-34B, which supports context length of 200K, reveals large room for improvement as we increase input length and task complexity. We open source RULER to spur comprehensive evaluation of long-context LMs.

Summary

AI-Generated Summary

PDF383December 15, 2024