K-Sort竞技场：通过K-路人类偏好进行高效可靠的生成模型基准测试

摘要

视觉生成模型的快速发展需要高效可靠的评估方法。Arena 平台汇总用户对模型比较的投票，可以根据人类偏好对模型进行排名。然而，传统的 Arena 方法虽然已经建立，但需要大量比较才能收敛排名，并且容易受到投票偏好噪音的影响，这表明需要针对当代评估挑战量身定制更好的方法。在本文中，我们介绍了 K-Sort Arena，这是一个高效可靠的平台，基于一个关键洞察：图像和视频比文本具有更高的感知直觉性，能够快速评估多个样本。因此，K-Sort Arena 使用 K 次比较，允许 K 个模型参与自由竞争，比成对比较获得更丰富的信息。为了增强系统的鲁棒性，我们利用概率建模和贝叶斯更新技术。我们提出了一种基于探索-利用的对手匹配策略，以促进更具信息性的比较。在我们的实验中，K-Sort Arena 的收敛速度比广泛使用的 ELO 算法快了 16.3 倍。为了进一步验证其优越性并获得全面的排行榜，我们通过众包评估收集了大量尖端的文本到图像和文本到视频模型的人类反馈。由于其高效性，K-Sort Arena 可以持续整合新兴模型，并以最少的投票更新排行榜。我们的项目经过数月的内部测试，现在可在 https://huggingface.co/spaces/ksort/K-Sort-Arena 上使用。

English

The rapid advancement of visual generative models necessitates efficient and reliable evaluation methods. Arena platform, which gathers user votes on model comparisons, can rank models with human preferences. However, traditional Arena methods, while established, require an excessive number of comparisons for ranking to converge and are vulnerable to preference noise in voting, suggesting the need for better approaches tailored to contemporary evaluation challenges. In this paper, we introduce K-Sort Arena, an efficient and reliable platform based on a key insight: images and videos possess higher perceptual intuitiveness than texts, enabling rapid evaluation of multiple samples simultaneously. Consequently, K-Sort Arena employs K-wise comparisons, allowing K models to engage in free-for-all competitions, which yield much richer information than pairwise comparisons. To enhance the robustness of the system, we leverage probabilistic modeling and Bayesian updating techniques. We propose an exploration-exploitation-based matchmaking strategy to facilitate more informative comparisons. In our experiments, K-Sort Arena exhibits 16.3x faster convergence compared to the widely used ELO algorithm. To further validate the superiority and obtain a comprehensive leaderboard, we collect human feedback via crowdsourced evaluations of numerous cutting-edge text-to-image and text-to-video models. Thanks to its high efficiency, K-Sort Arena can continuously incorporate emerging models and update the leaderboard with minimal votes. Our project has undergone several months of internal testing and is now available at https://huggingface.co/spaces/ksort/K-Sort-Arena

K-Sort竞技场：通过K-路人类偏好进行高效可靠的生成模型基准测试

K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences

摘要

Support