WritingBench:生成式寫作的綜合基準測試平台
WritingBench: A Comprehensive Benchmark for Generative Writing
March 7, 2025
作者: Yuning Wu, Jiahao Mei, Ming Yan, Chenliang Li, SHaopeng Lai, Yuran Ren, Zijia Wang, Ji Zhang, Mengyue Wu, Qin Jin, Fei Huang
cs.AI
摘要
近期大型語言模型(LLMs)的進展顯著提升了文本生成能力,然而評估其在生成性寫作中的表現仍是一大挑戰。現有的基準測試主要聚焦於通用文本生成或有限的寫作任務,未能涵蓋跨領域高質量書面內容的多樣化需求。為彌補這一差距,我們提出了WritingBench,這是一個全面的基準測試,旨在評估LLMs在6個核心寫作領域及100個子領域中的表現,涵蓋創意、說服性、信息性和技術性寫作。我們進一步提出了一種查詢依賴的評估框架,使LLMs能夠動態生成針對特定實例的評估標準。該框架輔以一個微調的批評模型,用於基於標準的評分,從而實現風格、格式和長度方面的評估。該框架的有效性還通過其數據策展能力得到進一步證明,使7B參數模型能夠接近最先進(SOTA)的性能。我們開源了這一基準測試,連同評估工具和模塊化框架組件,以推動LLMs在寫作領域的發展。
English
Recent advancements in large language models (LLMs) have significantly
enhanced text generation capabilities, yet evaluating their performance in
generative writing remains a challenge. Existing benchmarks primarily focus on
generic text generation or limited in writing tasks, failing to capture the
diverse requirements of high-quality written contents across various domains.
To bridge this gap, we present WritingBench, a comprehensive benchmark designed
to evaluate LLMs across 6 core writing domains and 100 subdomains, encompassing
creative, persuasive, informative, and technical writing. We further propose a
query-dependent evaluation framework that empowers LLMs to dynamically generate
instance-specific assessment criteria. This framework is complemented by a
fine-tuned critic model for criteria-aware scoring, enabling evaluations in
style, format and length. The framework's validity is further demonstrated by
its data curation capability, which enables 7B-parameter models to approach
state-of-the-art (SOTA) performance. We open-source the benchmark, along with
evaluation tools and modular framework components, to advance the development
of LLMs in writing.Summary
AI-Generated Summary