WritingBench: 생성적 글쓰기를 위한 포괄적 벤치마크

초록

최근 대규모 언어 모델(LLM)의 발전으로 텍스트 생성 능력이 크게 향상되었으나, 생성적 글쓰기에서의 성능 평가는 여전히 도전적인 과제로 남아 있습니다. 기존 벤치마크는 주로 일반적인 텍스트 생성이나 제한된 글쓰기 작업에 초점을 맞추고 있어, 다양한 분야에서 고품질의 글쓰기 요구사항을 충분히 반영하지 못하고 있습니다. 이러한 격차를 해소하기 위해, 우리는 6개의 핵심 글쓰기 영역과 100개의 하위 영역을 포괄하는 WritingBench라는 포괄적인 벤치마크를 제안합니다. 이 벤치마크는 창의적, 설득적, 정보 제공적, 기술적 글쓰기를 모두 포함합니다. 또한, 우리는 LLM이 동적으로 인스턴스별 평가 기준을 생성할 수 있도록 하는 쿼리 의존적 평가 프레임워크를 제안합니다. 이 프레임워크는 기준 인식 채점을 위한 미세 조정된 비평 모델로 보완되어 스타일, 형식, 길이 측면에서의 평가를 가능하게 합니다. 이 프레임워크의 타당성은 데이터 큐레이션 능력을 통해 더욱 입증되었으며, 이를 통해 7B 파라미터 모델이 최첨단(SOTA) 성능에 근접할 수 있음을 보여줍니다. 우리는 벤치마크와 평가 도구, 모듈식 프레임워크 구성 요소를 오픈소스로 공개하여 LLM의 글쓰기 발전을 촉진하고자 합니다.

English

Recent advancements in large language models (LLMs) have significantly enhanced text generation capabilities, yet evaluating their performance in generative writing remains a challenge. Existing benchmarks primarily focus on generic text generation or limited in writing tasks, failing to capture the diverse requirements of high-quality written contents across various domains. To bridge this gap, we present WritingBench, a comprehensive benchmark designed to evaluate LLMs across 6 core writing domains and 100 subdomains, encompassing creative, persuasive, informative, and technical writing. We further propose a query-dependent evaluation framework that empowers LLMs to dynamically generate instance-specific assessment criteria. This framework is complemented by a fine-tuned critic model for criteria-aware scoring, enabling evaluations in style, format and length. The framework's validity is further demonstrated by its data curation capability, which enables 7B-parameter models to approach state-of-the-art (SOTA) performance. We open-source the benchmark, along with evaluation tools and modular framework components, to advance the development of LLMs in writing.