ChatPaper.aiChatPaper

人工智能中的幽默:大规模众包偏好和基准,用于卡通字幕。

Humor in AI: Massive Scale Crowd-Sourced Preferences and Benchmarks for Cartoon Captioning

June 15, 2024
作者: Jifan Zhang, Lalit Jain, Yang Guo, Jiayi Chen, Kuan Lok Zhou, Siddharth Suresh, Andrew Wagenmaker, Scott Sievert, Timothy Rogers, Kevin Jamieson, Robert Mankoff, Robert Nowak
cs.AI

摘要

我们提出了一个新颖的多模态偏好数据集,用于创意任务,包括在过去八年中通过众包方式收集的《纽约客》每周漫画标题比赛的超过2.2百万个标题上的2.5亿人类评分。这一独特的数据集支持多模态大型语言模型的开发和评估,以及基于偏好的幽默标题生成算法的微调。我们提出了用于评估模型生成标题质量的新颖基准,利用GPT4和人类判断来建立基于排名的评估策略。我们的实验结果突显了当前微调方法(如RLHF和DPO)在应用于创意任务时的局限性。此外,我们证明即使像GPT4和Claude这样的最先进模型目前在生成幽默标题方面也不如顶尖人类选手表现出色。随着对这一广泛数据收集工作的总结,我们向研究界发布整个偏好数据集,促进AI幽默生成和评估的进一步发展。
English
We present a novel multimodal preference dataset for creative tasks, consisting of over 250 million human ratings on more than 2.2 million captions, collected through crowdsourcing rating data for The New Yorker's weekly cartoon caption contest over the past eight years. This unique dataset supports the development and evaluation of multimodal large language models and preference-based fine-tuning algorithms for humorous caption generation. We propose novel benchmarks for judging the quality of model-generated captions, utilizing both GPT4 and human judgments to establish ranking-based evaluation strategies. Our experimental results highlight the limitations of current fine-tuning methods, such as RLHF and DPO, when applied to creative tasks. Furthermore, we demonstrate that even state-of-the-art models like GPT4 and Claude currently underperform top human contestants in generating humorous captions. As we conclude this extensive data collection effort, we release the entire preference dataset to the research community, fostering further advancements in AI humor generation and evaluation.

Summary

AI-Generated Summary

PDF72December 6, 2024