POEMetric: L'Ultima Stanza dell'Umanità

Abstract

I modelli linguistici di grandi dimensioni (LLM) sono in grado di comporre poesie, ma quanto sono distanti dai poeti umani? In questo articolo presentiamo POEMetric, il primo framework completo per la valutazione della poesia, che esamina 1) le capacità fondamentali di seguire istruzioni nel generare poesie secondo una determinata forma e tema, 2) le capacità avanzate di dimostrare creatività, diversità lessicale e idiosincrasia, evocare risonanza emotiva e utilizzare immagini e dispositivi letterari, e 3) la valutazione generale della qualità complessiva della poesia e la stima della paternità. Abbiamo curato un dataset di poesie umane - 203 poesie inglesi di 7 forme fisse annotate con metro, schemi di rima e temi - e abbiamo sperimentato 30 LLM per la generazione di poesie basate sulle stesse forme e temi dei dati umani, per un totale di 6.090 poesie generate da LLM. Basandoci su POEMetric, abbiamo valutato le prestazioni sia dei poeti umani che degli LLM attraverso valutazioni basate su regole e un approccio LLM-as-a-judge, i cui risultati sono stati convalidati da esperti umani. I risultati mostrano che, sebbene il modello migliore abbia raggiunto un'elevata accuratezza formale (4,26 su 5,00, con Gemini-2.5-Pro come giudice; stesso criterio in seguito) e allineamento tematico (4,99), tutti i modelli non sono riusciti a raggiungere lo stesso livello di capacità avanzate dei poeti umani, che hanno ottenuto risultati insuperati in creatività (4,02), idiosincrasia (3,95), risonanza emotiva (4,06) e uso abile di immagini (4,49) e dispositivi letterari (4,67). Gli umani hanno anche superato l'LLM dalle migliori prestazioni nella qualità complessiva della poesia (4,22 vs. 3,20). Pertanto, la generazione di poesie rimane una sfida formidabile per gli LLM. Dati e codici sono rilasciati su https://github.com/Bingru-Li/POEMetric.

English

Large Language Models (LLMs) can compose poetry, but how far are they from human poets? In this paper, we introduce POEMetric, the first comprehensive framework for poetry evaluation, examining 1) basic instruction-following abilities in generating poems according to a certain form and theme, 2) advanced abilities of showing creativity, lexical diversity, and idiosyncrasy, evoking emotional resonance, and using imagery and literary devices, and 3) general appraisal of the overall poem quality and estimation of authorship. We curated a human poem dataset - 203 English poems of 7 fixed forms annotated with meter, rhyme patterns and themes - and experimented with 30 LLMs for poetry generation based on the same forms and themes of the human data, totaling 6,090 LLM poems. Based on POEMetric, we assessed the performance of both human poets and LLMs through rule-based evaluation and LLM-as-a-judge, whose results were validated by human experts. Results show that, though the top model achieved high form accuracy (4.26 out of 5.00, with Gemini-2.5-Pro as a judge; same below) and theme alignment (4.99), all models failed to reach the same level of advanced abilities as human poets, who achieved unparalleled creativity (4.02), idiosyncrasy (3.95), emotional resonance (4.06), and skillful use of imagery (4.49) and literary devices (4.67). Humans also defeated the best-performing LLM in overall poem quality (4.22 vs. 3.20). As such, poetry generation remains a formidable challenge for LLMs. Data and codes are released at https://github.com/Bingru-Li/POEMetric.

POEMetric: L'Ultima Stanza dell'Umanità

POEMetric: The Last Stanza of Humanity

Abstract

Support