ChatPaper.aiChatPaper

GenAI Arena:一個用於生成模型的開放評估平台

GenAI Arena: An Open Evaluation Platform for Generative Models

June 6, 2024
作者: Dongfu Jiang, Max Ku, Tianle Li, Yuansheng Ni, Shizhuo Sun, Rongqi Fan, Wenhu Chen
cs.AI

摘要

生成式人工智慧在改革影像和影片生成等領域取得了顯著進展。這些進步是由創新的演算法、架構和資料推動的。然而,生成模型的快速擴散凸顯了一個關鍵缺陷:缺乏可信賴的評估指標。目前的自動評估方法,如FID、CLIP、FVD等,常常無法捕捉與生成輸出相關的微妙品質和使用者滿意度。本文提出了一個名為GenAI-Arena的開放平台,用於評估不同的影像和影片生成模型,使用者可以積極參與評估這些模型。通過利用集體使用者反饋和投票,GenAI-Arena旨在提供更具民主性和準確性的模型表現評估。它分為三個領域,分別是文本轉影像生成、文本轉影片生成和影像編輯。目前,我們涵蓋了總共27個開源生成模型。GenAI-Arena已運作了四個月,從社群中積累了超過6000次投票。我們描述了我們的平台,分析了數據,並解釋了用於排名模型的統計方法。為了進一步推動建立基於模型的評估指標的研究,我們釋出了我們三個任務的偏好數據的清理版本,即GenAI-Bench。我們促使現有的多模型,如Gemini、GPT-4o,模仿人類投票。我們計算模型投票與人類投票之間的相關性,以了解它們的評判能力。我們的結果顯示,現有的多模型在評估生成的視覺內容方面仍然落後,即使最佳模型GPT-4o在品質子分數上也只達到0.22的皮爾遜相關係數,並在其他方面表現得像隨機猜測一樣。
English
Generative AI has made remarkable strides to revolutionize fields such as image and video generation. These advancements are driven by innovative algorithms, architecture, and data. However, the rapid proliferation of generative models has highlighted a critical gap: the absence of trustworthy evaluation metrics. Current automatic assessments such as FID, CLIP, FVD, etc often fail to capture the nuanced quality and user satisfaction associated with generative outputs. This paper proposes an open platform GenAI-Arena to evaluate different image and video generative models, where users can actively participate in evaluating these models. By leveraging collective user feedback and votes, GenAI-Arena aims to provide a more democratic and accurate measure of model performance. It covers three arenas for text-to-image generation, text-to-video generation, and image editing respectively. Currently, we cover a total of 27 open-source generative models. GenAI-Arena has been operating for four months, amassing over 6000 votes from the community. We describe our platform, analyze the data, and explain the statistical methods for ranking the models. To further promote the research in building model-based evaluation metrics, we release a cleaned version of our preference data for the three tasks, namely GenAI-Bench. We prompt the existing multi-modal models like Gemini, GPT-4o to mimic human voting. We compute the correlation between model voting with human voting to understand their judging abilities. Our results show existing multimodal models are still lagging in assessing the generated visual content, even the best model GPT-4o only achieves a Pearson correlation of 0.22 in the quality subscore, and behaves like random guessing in others.

Summary

AI-Generated Summary

PDF230December 8, 2024