人工智慧中的幽默:大規模眾包偏好和標題漫畫的基準。
Humor in AI: Massive Scale Crowd-Sourced Preferences and Benchmarks for Cartoon Captioning
June 15, 2024
作者: Jifan Zhang, Lalit Jain, Yang Guo, Jiayi Chen, Kuan Lok Zhou, Siddharth Suresh, Andrew Wagenmaker, Scott Sievert, Timothy Rogers, Kevin Jamieson, Robert Mankoff, Robert Nowak
cs.AI
摘要
我們提出了一個新穎的多模式偏好數據集,用於創意任務,包含超過兩億五千萬個人類對超過兩百二十萬個標題的評分,這些評分是在過去八年中通過群眾外包評分數據收集而來,用於《紐約客》每週漫畫標題比賽。這個獨特的數據集支持多模式大型語言模型的開發和評估,以及基於偏好的用於幽默標題生成的微調算法。我們提出了新穎的基準來評估模型生成的標題的質量,利用GPT4和人類判斷來建立基於排名的評估策略。我們的實驗結果突顯了當應用於創意任務時,目前微調方法(如RLHF和DPO)的局限性。此外,我們展示了即使是像GPT4和Claude這樣的最先進模型,在生成幽默標題方面也遠遜於頂尖的人類參賽者。隨著我們結束這一項廣泛的數據收集工作,我們將整個偏好數據集釋放給研究社區,促進AI幽默生成和評估的進一步發展。
English
We present a novel multimodal preference dataset for creative tasks,
consisting of over 250 million human ratings on more than 2.2 million captions,
collected through crowdsourcing rating data for The New Yorker's weekly cartoon
caption contest over the past eight years. This unique dataset supports the
development and evaluation of multimodal large language models and
preference-based fine-tuning algorithms for humorous caption generation. We
propose novel benchmarks for judging the quality of model-generated captions,
utilizing both GPT4 and human judgments to establish ranking-based evaluation
strategies. Our experimental results highlight the limitations of current
fine-tuning methods, such as RLHF and DPO, when applied to creative tasks.
Furthermore, we demonstrate that even state-of-the-art models like GPT4 and
Claude currently underperform top human contestants in generating humorous
captions. As we conclude this extensive data collection effort, we release the
entire preference dataset to the research community, fostering further
advancements in AI humor generation and evaluation.Summary
AI-Generated Summary