ChatPaper.aiChatPaper

POEMetric:人类文明的终章

POEMetric: The Last Stanza of Humanity

April 4, 2026
作者: Bingru Li, Han Wang, Hazel Wilkinson
cs.AI

摘要

大型语言模型(LLMs)已能创作诗歌,但它们与人类诗人的差距究竟有多大?本文提出首个诗歌综合评估框架POEMetric,从三个维度进行考察:1)基础指令遵循能力,即按特定格式与主题生成诗歌;2)高阶能力,包括展现创造性、词汇多样性、个人风格、唤起情感共鸣、运用意象与修辞手法;3)整体质量评估与作者归属判断。我们构建了人类诗歌数据集(含7种固定格式的203首英文诗,标注格律、韵律模式及主题),并以相同格式主题让30个LLMs生成诗歌,总计获得6,090首LLM诗作。基于POEMetric框架,我们通过规则化评估与LLM作为评判者的方式对比人类诗人与LLMs的表现,结果经专家验证。研究表明:尽管最优模型在格式准确性(以Gemini-2.5-Pro为评判者,得分4.26/5.00)和主题契合度(4.99)上表现优异,但所有模型在创造性(4.02)、独特性(3.95)、情感共鸣(4.06)、意象运用(4.49)及修辞手法(4.67)等高阶能力上均未达到人类诗人水准。人类诗人同时在整体诗歌质量上优于最佳LLM(4.22 vs. 3.20)。由此可见,诗歌创作仍是LLMs面临的重大挑战。数据与代码已发布于https://github.com/Bingru-Li/POEMetric。
English
Large Language Models (LLMs) can compose poetry, but how far are they from human poets? In this paper, we introduce POEMetric, the first comprehensive framework for poetry evaluation, examining 1) basic instruction-following abilities in generating poems according to a certain form and theme, 2) advanced abilities of showing creativity, lexical diversity, and idiosyncrasy, evoking emotional resonance, and using imagery and literary devices, and 3) general appraisal of the overall poem quality and estimation of authorship. We curated a human poem dataset - 203 English poems of 7 fixed forms annotated with meter, rhyme patterns and themes - and experimented with 30 LLMs for poetry generation based on the same forms and themes of the human data, totaling 6,090 LLM poems. Based on POEMetric, we assessed the performance of both human poets and LLMs through rule-based evaluation and LLM-as-a-judge, whose results were validated by human experts. Results show that, though the top model achieved high form accuracy (4.26 out of 5.00, with Gemini-2.5-Pro as a judge; same below) and theme alignment (4.99), all models failed to reach the same level of advanced abilities as human poets, who achieved unparalleled creativity (4.02), idiosyncrasy (3.95), emotional resonance (4.06), and skillful use of imagery (4.49) and literary devices (4.67). Humans also defeated the best-performing LLM in overall poem quality (4.22 vs. 3.20). As such, poetry generation remains a formidable challenge for LLMs. Data and codes are released at https://github.com/Bingru-Li/POEMetric.
PDF11April 8, 2026