LongGenBench: 長文脈生成ベンチマーク

要旨

現在の長いコンテキストのベンチマークは主に検索ベースのテストに焦点を当てており、Large Language Models（LLMs）に特定の情報を広範な入力コンテキスト内で見つけさせる必要があります。例えば、針の穴を見つける（NIAH）ベンチマークがあります。長いコンテキスト生成とは、言語モデルが長いパッセージや文書全体にわたる、結合性があり文脈に即したテキストを生成する能力を指します。最近の研究では、NIAHや他の検索ベースの長いコンテキストのベンチマークで高いパフォーマンスを示していますが、長いコンテキスト生成能力を評価するためのベンチマークが著しく不足しています。このギャップを埋め、包括的な評価を提供するために、柔軟な設定でカスタマイズされた生成コンテキストの長さを可能にする合成ベンチマーク、LongGenBenchを紹介します。LongGenBenchは、従来のベンチマークを進化させ、質問の形式を再設計し、LLMsが一つの統一された長いコンテキストの回答をすることを必要とします。LongGenBenchを使用した包括的な評価の結果、次のことが観察されました：（1）APIアクセスおよびオープンソースモデルの両方が、長いコンテキスト生成シナリオで1.2％から47.1％の範囲でパフォーマンスの低下が見られます；（2）異なる系列のLLMsは、パフォーマンスの低下の傾向が異なり、APIアクセスモデルの中でGemini-1.5-Flashモデルが最も低下が少なく、オープンソースモデルの中でQwen2シリーズがLongGenBenchで最も低下が少ないことが観察されました。

English

Current long-context benchmarks primarily focus on retrieval-based tests, requiring Large Language Models (LLMs) to locate specific information within extensive input contexts, such as the needle-in-a-haystack (NIAH) benchmark. Long-context generation refers to the ability of a language model to generate coherent and contextually accurate text that spans across lengthy passages or documents. While recent studies show strong performance on NIAH and other retrieval-based long-context benchmarks, there is a significant lack of benchmarks for evaluating long-context generation capabilities. To bridge this gap and offer a comprehensive assessment, we introduce a synthetic benchmark, LongGenBench, which allows for flexible configurations of customized generation context lengths. LongGenBench advances beyond traditional benchmarks by redesigning the format of questions and necessitating that LLMs respond with a single, cohesive long-context answer. Upon extensive evaluation using LongGenBench, we observe that: (1) both API accessed and open source models exhibit performance degradation in long-context generation scenarios, ranging from 1.2% to 47.1%; (2) different series of LLMs exhibit varying trends of performance degradation, with the Gemini-1.5-Flash model showing the least degradation among API accessed models, and the Qwen2 series exhibiting the least degradation in LongGenBench among open source models.

LongGenBench: 長文脈生成ベンチマーク

LongGenBench: Long-context Generation Benchmark

要旨

Support