GenAI Arena: 生成モデルのためのオープン評価プラットフォーム

要旨

生成AIは、画像や動画生成などの分野において革命的な進歩を遂げてきました。これらの進歩は、革新的なアルゴリズム、アーキテクチャ、データによって推進されています。しかし、生成モデルの急速な普及により、信頼できる評価指標の欠如という重大なギャップが浮き彫りになりました。現在の自動評価指標（FID、CLIP、FVDなど）は、生成出力の微妙な品質やユーザー満足度を捉えることができないことが多いです。本論文では、異なる画像および動画生成モデルを評価するためのオープンプラットフォーム「GenAI-Arena」を提案します。このプラットフォームでは、ユーザーが積極的にモデル評価に参加できます。GenAI-Arenaは、ユーザーのフィードバックと投票を活用することで、より民主的で正確なモデル性能の測定を目指しています。テキストから画像生成、テキストから動画生成、画像編集の3つのアリーナをカバーしており、現在合計27のオープンソース生成モデルを対象としています。GenAI-Arenaは4ヶ月間運営され、コミュニティから6000以上の投票を集めました。本論文では、プラットフォームの説明、データの分析、モデルをランク付けするための統計手法を解説します。さらに、モデルベースの評価指標の研究を促進するため、3つのタスクに対する選好データのクリーン版「GenAI-Bench」を公開します。既存のマルチモーダルモデル（Gemini、GPT-4oなど）に人間の投票を模倣するよう促し、モデルの投票と人間の投票の相関を計算して、それらの判断能力を理解します。結果として、既存のマルチモーダルモデルは生成された視覚コンテンツの評価において依然として遅れを取っており、最良のモデルであるGPT-4oでさえ、品質サブスコアでピアソン相関0.22しか達成できず、他の項目ではランダムな推測に近い振る舞いを示しました。

English

Generative AI has made remarkable strides to revolutionize fields such as image and video generation. These advancements are driven by innovative algorithms, architecture, and data. However, the rapid proliferation of generative models has highlighted a critical gap: the absence of trustworthy evaluation metrics. Current automatic assessments such as FID, CLIP, FVD, etc often fail to capture the nuanced quality and user satisfaction associated with generative outputs. This paper proposes an open platform GenAI-Arena to evaluate different image and video generative models, where users can actively participate in evaluating these models. By leveraging collective user feedback and votes, GenAI-Arena aims to provide a more democratic and accurate measure of model performance. It covers three arenas for text-to-image generation, text-to-video generation, and image editing respectively. Currently, we cover a total of 27 open-source generative models. GenAI-Arena has been operating for four months, amassing over 6000 votes from the community. We describe our platform, analyze the data, and explain the statistical methods for ranking the models. To further promote the research in building model-based evaluation metrics, we release a cleaned version of our preference data for the three tasks, namely GenAI-Bench. We prompt the existing multi-modal models like Gemini, GPT-4o to mimic human voting. We compute the correlation between model voting with human voting to understand their judging abilities. Our results show existing multimodal models are still lagging in assessing the generated visual content, even the best model GPT-4o only achieves a Pearson correlation of 0.22 in the quality subscore, and behaves like random guessing in others.

GenAI Arena: 生成モデルのためのオープン評価プラットフォーム

GenAI Arena: An Open Evaluation Platform for Generative Models

要旨

Support