LongGenBench：长文本生成基准测试

摘要

当前的长文本基准主要侧重于基于检索的测试，要求大型语言模型（LLMs）在广泛的输入上下文中定位特定信息，例如“大海捞针”（NIAH）基准。长文本生成指的是语言模型生成贯穿长篇章或文档的连贯且上下文准确的文本的能力。尽管最近的研究表明在NIAH和其他基于检索的长文本基准上表现出色，但缺乏用于评估长文本生成能力的基准。为了弥补这一差距并提供全面评估，我们引入了一个合成基准，LongGenBench，允许灵活配置自定义生成上下文长度。LongGenBench通过重新设计问题格式，并要求LLMs以单一、连贯的长文本答案进行响应，超越了传统基准。通过对LongGenBench的广泛评估，我们观察到：（1）API访问和开源模型在长文本生成场景中表现出性能下降，范围从1.2%到47.1%不等；（2）不同系列的LLMs表现出不同的性能下降趋势，其中Gemini-1.5-Flash模型在API访问模型中表现出最小的下降，而Qwen2系列在LongGenBench中表现出开源模型中最小的下降。

English

Current long-context benchmarks primarily focus on retrieval-based tests, requiring Large Language Models (LLMs) to locate specific information within extensive input contexts, such as the needle-in-a-haystack (NIAH) benchmark. Long-context generation refers to the ability of a language model to generate coherent and contextually accurate text that spans across lengthy passages or documents. While recent studies show strong performance on NIAH and other retrieval-based long-context benchmarks, there is a significant lack of benchmarks for evaluating long-context generation capabilities. To bridge this gap and offer a comprehensive assessment, we introduce a synthetic benchmark, LongGenBench, which allows for flexible configurations of customized generation context lengths. LongGenBench advances beyond traditional benchmarks by redesigning the format of questions and necessitating that LLMs respond with a single, cohesive long-context answer. Upon extensive evaluation using LongGenBench, we observe that: (1) both API accessed and open source models exhibit performance degradation in long-context generation scenarios, ranging from 1.2% to 47.1%; (2) different series of LLMs exhibit varying trends of performance degradation, with the Gemini-1.5-Flash model showing the least degradation among API accessed models, and the Qwen2 series exhibiting the least degradation in LongGenBench among open source models.

LongGenBench：长文本生成基准测试

LongGenBench: Long-context Generation Benchmark

摘要

Support