K-Sort 競技場：透過 K-智慧人類偏好進行高效且可靠的生成模型基準測試

摘要

視覺生成模型的快速發展需要高效且可靠的評估方法。Arena平台收集用戶對模型比較的投票，可以根據人類偏好對模型進行排名。然而，傳統的Arena方法雖然已經確立，但需要大量比較才能收斂排名，並且容易受到投票偏好噪音的影響，這表明需要針對當代評估挑戰提出更好的方法。本文介紹了K-Sort Arena，這是一個高效且可靠的平台，基於一個關鍵洞察：圖像和視頻比文字具有更高的感知直觀性，使得可以同時快速評估多個樣本。因此，K-Sort Arena採用K路比較，允許K個模型參與自由競爭，這比兩兩比較提供了更豐富的信息。為了增強系統的穩健性，我們利用概率建模和貝葉斯更新技術。我們提出了一種基於探索-利用的配對策略，以促進更具信息性的比較。在我們的實驗中，K-Sort Arena的收斂速度比廣泛使用的ELO算法快了16.3倍。為了進一步驗證其優越性並獲得全面的排行榜，我們通過眾包評估收集了大量尖端的文本到圖像和文本到視頻模型的人類反饋。由於其高效性，K-Sort Arena可以持續納入新興模型並以最少的投票更新排行榜。我們的項目經過數月的內部測試，現在可在https://huggingface.co/spaces/ksort/K-Sort-Arena 上使用。

English

The rapid advancement of visual generative models necessitates efficient and reliable evaluation methods. Arena platform, which gathers user votes on model comparisons, can rank models with human preferences. However, traditional Arena methods, while established, require an excessive number of comparisons for ranking to converge and are vulnerable to preference noise in voting, suggesting the need for better approaches tailored to contemporary evaluation challenges. In this paper, we introduce K-Sort Arena, an efficient and reliable platform based on a key insight: images and videos possess higher perceptual intuitiveness than texts, enabling rapid evaluation of multiple samples simultaneously. Consequently, K-Sort Arena employs K-wise comparisons, allowing K models to engage in free-for-all competitions, which yield much richer information than pairwise comparisons. To enhance the robustness of the system, we leverage probabilistic modeling and Bayesian updating techniques. We propose an exploration-exploitation-based matchmaking strategy to facilitate more informative comparisons. In our experiments, K-Sort Arena exhibits 16.3x faster convergence compared to the widely used ELO algorithm. To further validate the superiority and obtain a comprehensive leaderboard, we collect human feedback via crowdsourced evaluations of numerous cutting-edge text-to-image and text-to-video models. Thanks to its high efficiency, K-Sort Arena can continuously incorporate emerging models and update the leaderboard with minimal votes. Our project has undergone several months of internal testing and is now available at https://huggingface.co/spaces/ksort/K-Sort-Arena

K-Sort 競技場：透過 K-智慧人類偏好進行高效且可靠的生成模型基準測試

K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences

摘要

Support