ChatPaper.aiChatPaper

LiveXiv -- 一個基於Arxiv論文內容的多模態實時基準測試

LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content

October 14, 2024
作者: Nimrod Shabtay, Felipe Maia Polo, Sivan Doveh, Wei Lin, M. Jehanzeb Mirza, Leshem Chosen, Mikhail Yurochkin, Yuekai Sun, Assaf Arbelle, Leonid Karlinsky, Raja Giryes
cs.AI

摘要

從網絡上爬取的數據對多模態模型進行大規模訓練,已顯示出在注入所需的世界知識以在多個下游任務上有效執行方面具有卓越的實用性。然而,從網絡上爬取數據的一個缺點可能是潛在地犧牲這些模型的能力常常被評估的基準。為了防止測試數據的污染並真正測試這些基礎模型的能力,我們提出了LiveXiv:一個基於科學ArXiv論文的可擴展的演進性實時基準。LiveXiv在任何給定的時間戳訪問特定領域的手稿,並提議自動生成視覺問答對(VQA)。這是在沒有任何人為參與的情況下完成的,利用手稿中的多模態內容,如圖表和表格。此外,我們引入了一種高效的評估方法,通過僅對模型子集進行評估來估計所有模型在演進基準上的性能。這顯著降低了整體評估成本。我們在我們基準的第一個版本上對多個開放和專有的大型多模態模型(LMMs)進行基準測試,展示了其具有挑戰性的特性,揭示了模型的真實能力,避免了污染。最後,為了確保高質量,我們已收集並評估了一個手動驗證的子集。通過將其整體結果與我們的自動標註進行比較,我們發現性能變異確實極小(<2.5%)。我們的數據集在HuggingFace上可用,在這裡將提供我們的代碼。
English
The large-scale training of multi-modal models on data scraped from the web has shown outstanding utility in infusing these models with the required world knowledge to perform effectively on multiple downstream tasks. However, one downside of scraping data from the web can be the potential sacrifice of the benchmarks on which the abilities of these models are often evaluated. To safeguard against test data contamination and to truly test the abilities of these foundation models we propose LiveXiv: A scalable evolving live benchmark based on scientific ArXiv papers. LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs (VQA). This is done without any human-in-the-loop, using the multi-modal content in the manuscripts, like graphs, charts, and tables. Moreover, we introduce an efficient evaluation approach that estimates the performance of all models on the evolving benchmark using evaluations of only a subset of models. This significantly reduces the overall evaluation cost. We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities, avoiding contamination. Lastly, in our commitment to high quality, we have collected and evaluated a manually verified subset. By comparing its overall results to our automatic annotations, we have found that the performance variance is indeed minimal (<2.5%). Our dataset is available online on HuggingFace, and our code will be available here.

Summary

AI-Generated Summary

PDF282November 16, 2024