POEMetric:人类终章
POEMetric: The Last Stanza of Humanity
April 4, 2026
作者: Bingru Li, Han Wang, Hazel Wilkinson
cs.AI
摘要
大型语言模型(LLMs)已能创作诗歌,但它们与人类诗人的差距究竟有多大?本文提出首个诗歌综合评估框架POEMetric,从三个维度进行考察:1)基础指令遵循能力——按特定格式与主题生成诗歌;2)高级创作能力——展现创造性、词汇多样性、独特性,引发情感共鸣,运用意象与文学手法;3)整体质量评估与作者归属判定。我们构建了人类诗歌数据集(含7种固定格式的203首英文诗作,标注格律、韵律模式及主题),并基于相同格式主题让30个LLMs生成共6,090首诗歌。通过规则评估与LLM作为评判者的双重检验(结果经专家验证),POEMetric系统评估了人类诗人与LLMs的表现。研究发现:尽管最优模型在格式准确性(5分制得4.26分,以Gemini-2.5-Pro为评判者;下同)与主题契合度(4.99分)上表现优异,但所有模型在高级创作能力上均未达到人类水平——人类在创造性(4.02)、独特性(3.95)、情感共鸣(4.06)、意象运用(4.49)及文学手法(4.67)方面展现绝对优势。人类诗歌整体质量(4.22分)也显著优于最佳LLM(3.20分)。由此可见,诗歌生成仍是LLMs面临的重大挑战。数据与代码已发布于https://github.com/Bingru-Li/POEMetric。
English
Large Language Models (LLMs) can compose poetry, but how far are they from human poets? In this paper, we introduce POEMetric, the first comprehensive framework for poetry evaluation, examining 1) basic instruction-following abilities in generating poems according to a certain form and theme, 2) advanced abilities of showing creativity, lexical diversity, and idiosyncrasy, evoking emotional resonance, and using imagery and literary devices, and 3) general appraisal of the overall poem quality and estimation of authorship. We curated a human poem dataset - 203 English poems of 7 fixed forms annotated with meter, rhyme patterns and themes - and experimented with 30 LLMs for poetry generation based on the same forms and themes of the human data, totaling 6,090 LLM poems. Based on POEMetric, we assessed the performance of both human poets and LLMs through rule-based evaluation and LLM-as-a-judge, whose results were validated by human experts. Results show that, though the top model achieved high form accuracy (4.26 out of 5.00, with Gemini-2.5-Pro as a judge; same below) and theme alignment (4.99), all models failed to reach the same level of advanced abilities as human poets, who achieved unparalleled creativity (4.02), idiosyncrasy (3.95), emotional resonance (4.06), and skillful use of imagery (4.49) and literary devices (4.67). Humans also defeated the best-performing LLM in overall poem quality (4.22 vs. 3.20). As such, poetry generation remains a formidable challenge for LLMs. Data and codes are released at https://github.com/Bingru-Li/POEMetric.