MixEval-X:来自真实世界数据混合的任意到任意评估
MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures
October 17, 2024
作者: Jinjie Ni, Yifan Song, Deepanway Ghosal, Bo Li, David Junhao Zhang, Xiang Yue, Fuzhao Xue, Zian Zheng, Kaichen Zhang, Mahir Shah, Kabir Jain, Yang You, Michael Shieh
cs.AI
摘要
感知和生成多种形式对于AI模型有效地学习和与现实世界信号互动至关重要,这需要可靠的评估来推动它们的发展。我们确定了当前评估中的两个主要问题:(1)不一致的标准,由不同社区塑造,具有不同的协议和成熟水平;以及(2)显著的查询、评分和泛化偏差。为了解决这些问题,我们引入了MixEval-X,这是第一个任意到任意的真实世界基准,旨在优化和标准化跨输入和输出形式的评估。我们提出了多模态基准混合和适应-校正流程,以重建真实世界任务分布,确保评估能够有效地泛化到真实世界用例。广泛的元评估显示,我们的方法有效地将基准样本与真实世界任务分布对齐,模型排名与众包的真实世界评估强相关(高达0.98)。我们提供全面的排行榜,重新排列现有模型和组织,并提供见解,以增进对多模态评估的理解,并为未来研究提供信息。
English
Perceiving and generating diverse modalities are crucial for AI models to
effectively learn from and engage with real-world signals, necessitating
reliable evaluations for their development. We identify two major issues in
current evaluations: (1) inconsistent standards, shaped by different
communities with varying protocols and maturity levels; and (2) significant
query, grading, and generalization biases. To address these, we introduce
MixEval-X, the first any-to-any real-world benchmark designed to optimize and
standardize evaluations across input and output modalities. We propose
multi-modal benchmark mixture and adaptation-rectification pipelines to
reconstruct real-world task distributions, ensuring evaluations generalize
effectively to real-world use cases. Extensive meta-evaluations show our
approach effectively aligns benchmark samples with real-world task
distributions and the model rankings correlate strongly with that of
crowd-sourced real-world evaluations (up to 0.98). We provide comprehensive
leaderboards to rerank existing models and organizations and offer insights to
enhance understanding of multi-modal evaluations and inform future research.Summary
AI-Generated Summary