LiveXiv -- 一个基于Arxiv论文内容的多模态实时基准测试
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content
October 14, 2024
作者: Nimrod Shabtay, Felipe Maia Polo, Sivan Doveh, Wei Lin, M. Jehanzeb Mirza, Leshem Chosen, Mikhail Yurochkin, Yuekai Sun, Assaf Arbelle, Leonid Karlinsky, Raja Giryes
cs.AI
摘要
从网络爬取的数据上进行大规模多模态模型的训练,在为这些模型注入所需的世界知识以在多个下游任务上有效执行方面显示出了卓越的实用性。然而,从网络上抓取数据的一个缺点可能是潜在地牺牲了经常用来评估这些模型能力的基准。为了防止测试数据污染并真正测试这些基础模型的能力,我们提出了LiveXiv:一个基于科学ArXiv论文的可扩展的不断发展的实时基准。LiveXiv在任何给定的时间戳访问特定领域的手稿,并提议自动生成视觉问答对(VQA)。这是在没有任何人为干预的情况下完成的,利用手稿中的多模态内容,如图表和表格。此外,我们引入了一种高效的评估方法,通过仅评估部分模型来估计所有模型在不断发展的基准上的性能。这显著降低了整体评估成本。我们在我们的基准的第一个版本上对多个开放和专有的大型多模态模型(LMMs)进行基准测试,展示了其具有挑战性的特性,并展示了模型的真实能力,避免了污染。最后,为了高质量的承诺,我们收集并评估了一个经过手动验证的子集。通过将其整体结果与我们的自动注释进行比较,我们发现性能变化确实很小(<2.5%)。我们的数据集可以在线获取HuggingFace,并且我们的代码将在此处提供。
English
The large-scale training of multi-modal models on data scraped from the web
has shown outstanding utility in infusing these models with the required world
knowledge to perform effectively on multiple downstream tasks. However, one
downside of scraping data from the web can be the potential sacrifice of the
benchmarks on which the abilities of these models are often evaluated. To
safeguard against test data contamination and to truly test the abilities of
these foundation models we propose LiveXiv: A scalable evolving live benchmark
based on scientific ArXiv papers. LiveXiv accesses domain-specific manuscripts
at any given timestamp and proposes to automatically generate visual
question-answer pairs (VQA). This is done without any human-in-the-loop, using
the multi-modal content in the manuscripts, like graphs, charts, and tables.
Moreover, we introduce an efficient evaluation approach that estimates the
performance of all models on the evolving benchmark using evaluations of only a
subset of models. This significantly reduces the overall evaluation cost. We
benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the
first version of our benchmark, showing its challenging nature and exposing the
models true abilities, avoiding contamination. Lastly, in our commitment to
high quality, we have collected and evaluated a manually verified subset. By
comparing its overall results to our automatic annotations, we have found that
the performance variance is indeed minimal (<2.5%). Our dataset is available
online on HuggingFace, and our code will be available here.Summary
AI-Generated Summary