K-Sort竞技场:通过K-路人类偏好进行高效可靠的生成模型基准测试
K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences
August 26, 2024
作者: Zhikai Li, Xuewen Liu, Dongrong Fu, Jianquan Li, Qingyi Gu, Kurt Keutzer, Zhen Dong
cs.AI
摘要
视觉生成模型的快速发展需要高效可靠的评估方法。Arena 平台汇总用户对模型比较的投票,可以根据人类偏好对模型进行排名。然而,传统的 Arena 方法虽然已经建立,但需要大量比较才能收敛排名,并且容易受到投票偏好噪音的影响,这表明需要针对当代评估挑战量身定制更好的方法。在本文中,我们介绍了 K-Sort Arena,这是一个高效可靠的平台,基于一个关键洞察:图像和视频比文本具有更高的感知直觉性,能够快速评估多个样本。因此,K-Sort Arena 使用 K 次比较,允许 K 个模型参与自由竞争,比成对比较获得更丰富的信息。为了增强系统的鲁棒性,我们利用概率建模和贝叶斯更新技术。我们提出了一种基于探索-利用的对手匹配策略,以促进更具信息性的比较。在我们的实验中,K-Sort Arena 的收敛速度比广泛使用的 ELO 算法快了 16.3 倍。为了进一步验证其优越性并获得全面的排行榜,我们通过众包评估收集了大量尖端的文本到图像和文本到视频模型的人类反馈。由于其高效性,K-Sort Arena 可以持续整合新兴模型,并以最少的投票更新排行榜。我们的项目经过数月的内部测试,现在可在 https://huggingface.co/spaces/ksort/K-Sort-Arena 上使用。
English
The rapid advancement of visual generative models necessitates efficient and
reliable evaluation methods. Arena platform, which gathers user votes on model
comparisons, can rank models with human preferences. However, traditional Arena
methods, while established, require an excessive number of comparisons for
ranking to converge and are vulnerable to preference noise in voting,
suggesting the need for better approaches tailored to contemporary evaluation
challenges. In this paper, we introduce K-Sort Arena, an efficient and reliable
platform based on a key insight: images and videos possess higher perceptual
intuitiveness than texts, enabling rapid evaluation of multiple samples
simultaneously. Consequently, K-Sort Arena employs K-wise comparisons, allowing
K models to engage in free-for-all competitions, which yield much richer
information than pairwise comparisons. To enhance the robustness of the system,
we leverage probabilistic modeling and Bayesian updating techniques. We propose
an exploration-exploitation-based matchmaking strategy to facilitate more
informative comparisons. In our experiments, K-Sort Arena exhibits 16.3x faster
convergence compared to the widely used ELO algorithm. To further validate the
superiority and obtain a comprehensive leaderboard, we collect human feedback
via crowdsourced evaluations of numerous cutting-edge text-to-image and
text-to-video models. Thanks to its high efficiency, K-Sort Arena can
continuously incorporate emerging models and update the leaderboard with
minimal votes. Our project has undergone several months of internal testing and
is now available at https://huggingface.co/spaces/ksort/K-Sort-ArenaSummary
AI-Generated Summary