POEMetric: 人類最後の詩節

要旨

大規模言語モデル（LLM）は詩を作成できるが、人間の詩人からどれほど離れているだろうか。本論文では、詩の評価における初の包括的フレームワークであるPOEMetricを提案する。これは、1) 特定の形式とテーマに沿った詩生成における基本的な指示追従能力、2) 創造性、語彙の多様性、独自性の表現、感情の共感の喚起、イメージや文学的技法の使用といった高度な能力、3) 詩の総合的な品質評価と作者推定を検証する。我々は、韻律、韻パターン、テーマが注釈付けられた7つの定型形式の203編の英語詩からなる人間の詩データセットを整備し、人間のデータと同じ形式とテーマに基づいて30のLLMによる詩生成実験を行い、合計6,090編のLLM詩を生成した。POEMetricに基づき、ルールベース評価と審判としてのLLM（LLM-as-a-judge）を通じて、人間の詩人とLLMの両方のパフォーマンスを評価し、その結果は人間の専門家によって検証された。結果では、最高性能のモデルが高い形式正確性（5点満点中4.26点、審判はGemini-2.5-Pro。以下同様）とテーマ整合性（4.99点）を達成したものの、すべてのモデルは、人間の詩人が達成した比類なき創造性（4.02点）、独自性（3.95点）、感情の共感（4.06点）、イメージ（4.49点）および文学的技法（4.67点）の巧みな使用といった高度な能力の水準には到達できなかった。総合的な詩の品質においても、人間は最高性能のLLMを上回った（4.22点対3.20点）。このように、詩の生成はLLMにとって依然として困難な課題である。データとコードはhttps://github.com/Bingru-Li/POEMetric で公開されている。

English

Large Language Models (LLMs) can compose poetry, but how far are they from human poets? In this paper, we introduce POEMetric, the first comprehensive framework for poetry evaluation, examining 1) basic instruction-following abilities in generating poems according to a certain form and theme, 2) advanced abilities of showing creativity, lexical diversity, and idiosyncrasy, evoking emotional resonance, and using imagery and literary devices, and 3) general appraisal of the overall poem quality and estimation of authorship. We curated a human poem dataset - 203 English poems of 7 fixed forms annotated with meter, rhyme patterns and themes - and experimented with 30 LLMs for poetry generation based on the same forms and themes of the human data, totaling 6,090 LLM poems. Based on POEMetric, we assessed the performance of both human poets and LLMs through rule-based evaluation and LLM-as-a-judge, whose results were validated by human experts. Results show that, though the top model achieved high form accuracy (4.26 out of 5.00, with Gemini-2.5-Pro as a judge; same below) and theme alignment (4.99), all models failed to reach the same level of advanced abilities as human poets, who achieved unparalleled creativity (4.02), idiosyncrasy (3.95), emotional resonance (4.06), and skillful use of imagery (4.49) and literary devices (4.67). Humans also defeated the best-performing LLM in overall poem quality (4.22 vs. 3.20). As such, poetry generation remains a formidable challenge for LLMs. Data and codes are released at https://github.com/Bingru-Li/POEMetric.

POEMetric: 人類最後の詩節

POEMetric: The Last Stanza of Humanity

要旨

Support