황금 실을 휘어라: 언어 모델에서의 장문 생성 벤치마킹

초록

긴 맥락 언어 모델(LMs)의 능력은 종종 "바늘을 바늘더미에서 찾기" (NIAH) 테스트를 사용하여 평가됩니다. 이 테스트는 모델이 대량의 텍스트 시퀀스("바늘더미") 내에서 특정 정보("바늘")를 식별하는 능력을 평가하기 위해 설계된 작업으로 구성됩니다. 이러한 벤치마크는 모델이 긴 맥락 입력 시퀀스를 얼마나 잘 이해하는지를 측정하지만, 긴 형식의 텍스트 생성 품질을 효과적으로 측정하지는 않습니다. 이는 디자인 제안서 및 창의적 글쓰기와 같은 응용 분야에 중요한 측면입니다. 이러한 공백을 해결하기 위해 우리는 새로운 긴 형식 텍스트 평가 벤치마크인 "황금 실을 휘날리며" (SGT)를 소개했습니다. 이 벤치마크는 모델이 생성된 긴 텍스트 시퀀스 내에서 특정 이벤트를 식별하는 능력을 테스트합니다. 이 벤치마크에서 우리는 긴 맥락 LMs에게 특정 이벤트나 제약 조건을 반드시 포함해야 하는 긴 형식 텍스트를 작성하도록 요청하고, 이러한 요소를 통합하는 능력을 평가합니다. 우리는 10개의 긴 맥락 LMs를 네 가지 다른 시나리오, 세 가지 유형의 프롬프트 지침, 두 가지 다른 생성 길이 설정(16K 및 32K)을 통해 평가했습니다. 이 모델들은 NIAH 벤치마크에서 성능이 우수하지만, "황금 실을 휘날리며"에서는 만족스러운 성과를 보이지 않아, 지시 사항을 따르는 일관된 긴 형식 텍스트를 생성하는 능력에 대한 우려가 제기되었습니다. 게다가, 생성된 텍스트의 길이가 증가함에 따라 모든 모델이 상당한 성능 하락을 보입니다.

English

The abilities of long-context language models (LMs) are often evaluated using the "Needle-in-a-Haystack" (NIAH) test, which comprises tasks designed to assess a model's ability to identify specific information ("needle") within large text sequences ("haystack"). While these benchmarks measure how well models understand long-context input sequences, they do not effectively gauge the quality of long-form text generation--a critical aspect for applications such as design proposals and creative writing. To address this gap, we have introduced a new long-form text evaluation benchmark, Spinning the Golden Thread (SGT), which tests models' ability to identify specific events within generated long text sequences. In this benchmark, we prompt long-context LMs to create long-form text that must include particular events or constraints and evaluate their ability to incorporate these elements. We evaluated ten long-context LMs across four distinct scenarios, three types of prompt instructions, and two different generation-length settings (16K and 32K). Although these models perform well on NIAH benchmarks, none demonstrated satisfactory performance on the Spinning the Golden Thread, raising concerns about their ability to generate coherent long-form text that follows instructions. Additionally, as the length of the generated text increases, all models exhibit a significant drop in performance.

황금 실을 휘어라: 언어 모델에서의 장문 생성 벤치마킹

Spinning the Golden Thread: Benchmarking Long-Form Generation in Language Models

초록

Support