LongGenBench：長文本生成基準

摘要

目前的長文本基準主要專注於檢索式測試，要求大型語言模型（LLMs）在廣泛的輸入上下文中定位特定信息，例如「草堆中的針」（NIAH）基準。長文本生成指的是語言模型生成跨越長篇章或文件的連貫且上下文準確的文本的能力。儘管最近的研究表明在NIAH和其他檢索式長文本基準上表現強勁，但對於評估長文本生成能力的基準嚴重不足。為彌補這一差距並提供全面評估，我們引入了一個合成基準，LongGenBench，允許靈活配置自定義生成上下文長度。LongGenBench通過重新設計問題格式，要求LLMs以單一、連貫的長文本答案回應，超越傳統基準。通過使用LongGenBench進行廣泛評估，我們觀察到：（1）無論是API訪問還是開源模型在長文本生成場景中均表現出性能下降，範圍從1.2%到47.1%不等；（2）不同系列的LLMs表現出不同的性能下降趨勢，其中Gemini-1.5-Flash模型在API訪問模型中表現出最小的下降，而Qwen2系列在LongGenBench中表現出開源模型中最小的下降。

English

Current long-context benchmarks primarily focus on retrieval-based tests, requiring Large Language Models (LLMs) to locate specific information within extensive input contexts, such as the needle-in-a-haystack (NIAH) benchmark. Long-context generation refers to the ability of a language model to generate coherent and contextually accurate text that spans across lengthy passages or documents. While recent studies show strong performance on NIAH and other retrieval-based long-context benchmarks, there is a significant lack of benchmarks for evaluating long-context generation capabilities. To bridge this gap and offer a comprehensive assessment, we introduce a synthetic benchmark, LongGenBench, which allows for flexible configurations of customized generation context lengths. LongGenBench advances beyond traditional benchmarks by redesigning the format of questions and necessitating that LLMs respond with a single, cohesive long-context answer. Upon extensive evaluation using LongGenBench, we observe that: (1) both API accessed and open source models exhibit performance degradation in long-context generation scenarios, ranging from 1.2% to 47.1%; (2) different series of LLMs exhibit varying trends of performance degradation, with the Gemini-1.5-Flash model showing the least degradation among API accessed models, and the Qwen2 series exhibiting the least degradation in LongGenBench among open source models.