MixEval-X：来自真实世界数据混合的任意到任意评估

摘要

感知和生成多种形式对于AI模型有效地学习和与现实世界信号互动至关重要，这需要可靠的评估来推动它们的发展。我们确定了当前评估中的两个主要问题：（1）不一致的标准，由不同社区塑造，具有不同的协议和成熟水平；以及（2）显著的查询、评分和泛化偏差。为了解决这些问题，我们引入了MixEval-X，这是第一个任意到任意的真实世界基准，旨在优化和标准化跨输入和输出形式的评估。我们提出了多模态基准混合和适应-校正流程，以重建真实世界任务分布，确保评估能够有效地泛化到真实世界用例。广泛的元评估显示，我们的方法有效地将基准样本与真实世界任务分布对齐，模型排名与众包的真实世界评估强相关（高达0.98）。我们提供全面的排行榜，重新排列现有模型和组织，并提供见解，以增进对多模态评估的理解，并为未来研究提供信息。

English

Perceiving and generating diverse modalities are crucial for AI models to effectively learn from and engage with real-world signals, necessitating reliable evaluations for their development. We identify two major issues in current evaluations: (1) inconsistent standards, shaped by different communities with varying protocols and maturity levels; and (2) significant query, grading, and generalization biases. To address these, we introduce MixEval-X, the first any-to-any real-world benchmark designed to optimize and standardize evaluations across input and output modalities. We propose multi-modal benchmark mixture and adaptation-rectification pipelines to reconstruct real-world task distributions, ensuring evaluations generalize effectively to real-world use cases. Extensive meta-evaluations show our approach effectively aligns benchmark samples with real-world task distributions and the model rankings correlate strongly with that of crowd-sourced real-world evaluations (up to 0.98). We provide comprehensive leaderboards to rerank existing models and organizations and offer insights to enhance understanding of multi-modal evaluations and inform future research.

MixEval-X：来自真实世界数据混合的任意到任意评估

MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures

摘要

Support