LongIns : Un examen exigeant basé sur des instructions à long contexte pour les LLM

Résumé

Les capacités de contexte long des grands modèles de langage (LLMs) ont été un sujet brûlant ces dernières années. Pour évaluer les performances des LLMs dans différents scénarios, divers benchmarks d'évaluation ont émergé. Cependant, comme la plupart de ces benchmarks se concentrent sur l'identification d'informations clés pour répondre à des questions, ce qui nécessite principalement la capacité de récupération des LLMs, ces benchmarks ne représentent que partiellement la performance de raisonnement des LLMs à partir de grandes quantités d'informations. Par ailleurs, bien que les LLMs prétendent souvent avoir des fenêtres de contexte de 32k, 128k, 200k, voire plus, ces benchmarks ne parviennent pas à révéler la longueur de contexte réellement supportée par ces LLMs. Pour résoudre ces problèmes, nous proposons le jeu de données LongIns, un examen basé sur des instructions de contexte long et exigeant pour les LLMs, construit à partir des jeux de données d'instructions existants. Plus précisément, dans notre LongIns, nous introduisons trois configurations d'évaluation : Instruction Globale & Tâche Unique (GIST), Instruction Locale & Tâche Unique (LIST), et Instruction Locale & Tâches Multiples (LIMT). Sur la base de LongIns, nous effectuons des évaluations complètes des LLMs existants et obtenons les conclusions importantes suivantes : (1) Le GPT-4, meilleur performant avec une longueur de contexte de 128k, obtient de mauvais résultats sur la fenêtre de contexte d'évaluation de 16k dans notre LongIns. (2) Pour la capacité de raisonnement multi-sauts de nombreux LLMs existants, des efforts significatifs sont encore nécessaires sous des fenêtres de contexte courtes (moins de 4k).

English

The long-context capabilities of large language models (LLMs) have been a hot topic in recent years. To evaluate the performance of LLMs in different scenarios, various assessment benchmarks have emerged. However, as most of these benchmarks focus on identifying key information to answer questions, which mainly requires the retrieval ability of LLMs, these benchmarks can partially represent the reasoning performance of LLMs from large amounts of information. Meanwhile, although LLMs often claim to have context windows of 32k, 128k, 200k, or even longer, these benchmarks fail to reveal the actual supported length of these LLMs. To address these issues, we propose the LongIns benchmark dataset, a challenging long-context instruction-based exam for LLMs, which is built based on the existing instruction datasets. Specifically, in our LongIns, we introduce three evaluation settings: Global Instruction & Single Task (GIST), Local Instruction & Single Task (LIST), and Local Instruction & Multiple Tasks (LIMT). Based on LongIns, we perform comprehensive evaluations on existing LLMs and have the following important findings: (1). The top-performing GPT-4 with 128k context length performs poorly on the evaluation context window of 16k in our LongIns. (2). For the multi-hop reasoning ability of many existing LLMs, significant efforts are still needed under short context windows (less than 4k).

LongIns : Un examen exigeant basé sur des instructions à long contexte pour les LLM

LongIns: A Challenging Long-context Instruction-based Exam for LLMs

Résumé

Support