RULER: Wie groß ist der tatsächliche Kontextumfang Ihrer Sprachmodelle mit langem Kontext?

papers.abstract

Der Nadel-im-Heuhaufen (NIAH) Test, der die Fähigkeit untersucht, ein Stück Information (die "Nadel") aus langen Ablenkungstexten (dem "Heuhaufen") abzurufen, wurde weitgehend übernommen, um Langkontext-Sprachmodelle (LMs) zu bewerten. Allerdings ist dieser einfache, auf Abruf basierende Test nur ein Hinweis auf eine oberflächliche Form des Langkontext-Verständnisses. Um eine umfassendere Bewertung von Langkontext-LMs bereitzustellen, haben wir einen neuen synthetischen Benchmark namens RULER mit flexiblen Konfigurationen für benutzerdefinierte Sequenzlängen und Aufgabenkomplexität erstellt. RULER erweitert den herkömmlichen NIAH Test, um Variationen mit verschiedenen Arten und Mengen von Nadeln abzudecken. Darüber hinaus führt RULER neue Aufgabenkategorien wie Mehrfachsprungverfolgung und Aggregation ein, um Verhaltensweisen jenseits der Suche im Kontext zu testen. Wir bewerten zehn Langkontext-LMs mit 13 repräsentativen Aufgaben in RULER. Trotz nahezu perfekter Genauigkeit im herkömmlichen NIAH Test zeigen alle Modelle deutliche Leistungsabfälle mit zunehmender Kontextlänge. Obwohl diese Modelle alle Kontextgrößen von 32K Tokens oder mehr beanspruchen, können nur vier Modelle (GPT-4, Command-R, Yi-34B und Mixtral) eine zufriedenstellende Leistung bei einer Länge von 32K beibehalten. Unsere Analyse von Yi-34B, das Kontextlängen von 200K unterstützt, zeigt einen großen Verbesserungsspielraum, wenn wir die Eingabelänge und die Aufgabenkomplexität erhöhen. Wir stellen RULER als Open Source zur Verfügung, um eine umfassende Bewertung von Langkontext-LMs anzustoßen.

English

The needle-in-a-haystack (NIAH) test, which examines the ability to retrieve a piece of information (the "needle") from long distractor texts (the "haystack"), has been widely adopted to evaluate long-context language models (LMs). However, this simple retrieval-based test is indicative of only a superficial form of long-context understanding. To provide a more comprehensive evaluation of long-context LMs, we create a new synthetic benchmark RULER with flexible configurations for customized sequence length and task complexity. RULER expands upon the vanilla NIAH test to encompass variations with diverse types and quantities of needles. Moreover, RULER introduces new task categories multi-hop tracing and aggregation to test behaviors beyond searching from context. We evaluate ten long-context LMs with 13 representative tasks in RULER. Despite achieving nearly perfect accuracy in the vanilla NIAH test, all models exhibit large performance drops as the context length increases. While these models all claim context sizes of 32K tokens or greater, only four models (GPT-4, Command-R, Yi-34B, and Mixtral) can maintain satisfactory performance at the length of 32K. Our analysis of Yi-34B, which supports context length of 200K, reveals large room for improvement as we increase input length and task complexity. We open source RULER to spur comprehensive evaluation of long-context LMs.

RULER: Wie groß ist der tatsächliche Kontextumfang Ihrer Sprachmodelle mit langem Kontext?

RULER: What's the Real Context Size of Your Long-Context Language Models?

papers.abstract

Support