LongGenBench: 장문맥 생성 벤치마크

초록

현재의 장문 맥락 벤치마크는 주로 검색 기반 테스트에 초점을 맞추고 있으며, 대규모 언어 모델(Large Language Models, LLMs)이 바늘을 찾는 것과 같은 방대한 입력 맥락에서 특정 정보를 찾는 능력을 요구하는 벤치마크인 바늘 속에서 바늘을 찾기(Needle-in-a-Haystack, NIAH) 벤치마크를 포함한다. 장문 맥락 생성은 언어 모델이 긴 단락이나 문서를 가로지르는 일관된 및 맥락적으로 정확한 텍스트를 생성하는 능력을 의미한다. 최근 연구들은 NIAH 및 기타 검색 기반 장문 맥락 벤치마크에서 뛰어난 성능을 보여주지만, 장문 맥락 생성 능력을 평가하는 벤치마크가 부족한 것이 심각한 문제로 대두된다. 이러한 공백을 메우고 종합적인 평가를 제공하기 위해 우리는 유연한 설정으로 맞춤형 생성 맥락 길이를 가능하게 하는 합성 벤치마크인 LongGenBench를 소개한다. LongGenBench는 전통적인 벤치마크를 넘어서 질문 형식을 재설계하고 LLMs가 단일하고 일관된 장문 답변을 제공해야 하는 것을 요구함으로써 발전한다. LongGenBench를 사용한 포괄적인 평가 결과, (1) API 접근 및 오픈 소스 모델 모두 장문 맥락 생성 시 성능 저하가 나타나며, 이는 1.2%에서 47.1% 범위에 이른다; (2) 다양한 시리즈의 LLMs는 성능 저하의 다양한 추세를 나타내며, API 접근 모델 중 Gemini-1.5-Flash 모델이 가장 적은 성능 저하를 보이며, 오픈 소스 모델 중 Qwen2 시리즈가 LongGenBench에서 가장 적은 성능 저하를 나타낸다.

English

Current long-context benchmarks primarily focus on retrieval-based tests, requiring Large Language Models (LLMs) to locate specific information within extensive input contexts, such as the needle-in-a-haystack (NIAH) benchmark. Long-context generation refers to the ability of a language model to generate coherent and contextually accurate text that spans across lengthy passages or documents. While recent studies show strong performance on NIAH and other retrieval-based long-context benchmarks, there is a significant lack of benchmarks for evaluating long-context generation capabilities. To bridge this gap and offer a comprehensive assessment, we introduce a synthetic benchmark, LongGenBench, which allows for flexible configurations of customized generation context lengths. LongGenBench advances beyond traditional benchmarks by redesigning the format of questions and necessitating that LLMs respond with a single, cohesive long-context answer. Upon extensive evaluation using LongGenBench, we observe that: (1) both API accessed and open source models exhibit performance degradation in long-context generation scenarios, ranging from 1.2% to 47.1%; (2) different series of LLMs exhibit varying trends of performance degradation, with the Gemini-1.5-Flash model showing the least degradation among API accessed models, and the Qwen2 series exhibiting the least degradation in LongGenBench among open source models.

LongGenBench: 장문맥 생성 벤치마크

LongGenBench: Long-context Generation Benchmark

초록

Support