GenAI Arena：一个用于生成模型开放评估的平台

摘要

生成式人工智能在改变图像和视频生成等领域取得了显著进展。这些进步是由创新算法、架构和数据推动的。然而，生成模型的快速增长凸显了一个关键缺口：缺乏可信赖的评估指标。当前的自动评估方法，如FID、CLIP、FVD等，经常无法捕捉与生成输出相关的微妙质量和用户满意度。本文提出了一个名为GenAI-Arena的开放平台，用于评估不同的图像和视频生成模型，用户可以积极参与对这些模型的评估。通过利用集体用户反馈和投票，GenAI-Arena旨在提供更民主和准确的模型性能评估。它涵盖了文本到图像生成、文本到视频生成和图像编辑三个领域。目前，我们涵盖了总共27个开源生成模型。GenAI-Arena已经运营了四个月，从社区中获得了超过6000次投票。我们描述了我们的平台，分析了数据，并解释了排名模型的统计方法。为了进一步推动建立基于模型的评估指标的研究，我们发布了我们三个任务的偏好数据的清理版本，即GenAI-Bench。我们促使现有的多模态模型如Gemini、GPT-4o去模仿人类投票。我们计算模型投票与人类投票之间的相关性，以了解它们的评判能力。我们的结果显示，现有的多模态模型在评估生成的视觉内容方面仍然落后，即使最佳模型GPT-4o在质量子分数上也仅达到0.22的皮尔逊相关性，并在其他方面表现得像随机猜测一样。

English

Generative AI has made remarkable strides to revolutionize fields such as image and video generation. These advancements are driven by innovative algorithms, architecture, and data. However, the rapid proliferation of generative models has highlighted a critical gap: the absence of trustworthy evaluation metrics. Current automatic assessments such as FID, CLIP, FVD, etc often fail to capture the nuanced quality and user satisfaction associated with generative outputs. This paper proposes an open platform GenAI-Arena to evaluate different image and video generative models, where users can actively participate in evaluating these models. By leveraging collective user feedback and votes, GenAI-Arena aims to provide a more democratic and accurate measure of model performance. It covers three arenas for text-to-image generation, text-to-video generation, and image editing respectively. Currently, we cover a total of 27 open-source generative models. GenAI-Arena has been operating for four months, amassing over 6000 votes from the community. We describe our platform, analyze the data, and explain the statistical methods for ranking the models. To further promote the research in building model-based evaluation metrics, we release a cleaned version of our preference data for the three tasks, namely GenAI-Bench. We prompt the existing multi-modal models like Gemini, GPT-4o to mimic human voting. We compute the correlation between model voting with human voting to understand their judging abilities. Our results show existing multimodal models are still lagging in assessing the generated visual content, even the best model GPT-4o only achieves a Pearson correlation of 0.22 in the quality subscore, and behaves like random guessing in others.

GenAI Arena：一个用于生成模型开放评估的平台

GenAI Arena: An Open Evaluation Platform for Generative Models

摘要

Support