FinMME：金融多模態推理評估基準數據集

摘要

多模態大型語言模型（MLLMs）近年來發展迅速。然而，在金融領域，尚缺乏有效且專業的多模態評估數據集。為推動MLLMs在金融領域的發展，我們推出了FinMME，涵蓋18個金融領域和6種資產類別，包含10種主要圖表類型及21種子類型，共計超過11,000個高質量金融研究樣本。我們通過20名註釋員及精心設計的驗證機制確保數據質量。此外，我們開發了FinScore評估系統，結合幻覺懲罰和多維能力評估，以提供公正的評價。大量實驗結果表明，即便是如GPT-4o這樣的頂尖模型，在FinMME上的表現也不盡如人意，凸顯了其挑戰性。該基準在不同提示下的預測變異保持在1%以下，展現出較現有數據集更高的穩健性和可靠性。我們的數據集及評估協議可於https://huggingface.co/datasets/luojunyu/FinMME和https://github.com/luo-junyu/FinMME獲取。

English

Multimodal Large Language Models (MLLMs) have experienced rapid development in recent years. However, in the financial domain, there is a notable lack of effective and specialized multimodal evaluation datasets. To advance the development of MLLMs in the finance domain, we introduce FinMME, encompassing more than 11,000 high-quality financial research samples across 18 financial domains and 6 asset classes, featuring 10 major chart types and 21 subtypes. We ensure data quality through 20 annotators and carefully designed validation mechanisms. Additionally, we develop FinScore, an evaluation system incorporating hallucination penalties and multi-dimensional capability assessment to provide an unbiased evaluation. Extensive experimental results demonstrate that even state-of-the-art models like GPT-4o exhibit unsatisfactory performance on FinMME, highlighting its challenging nature. The benchmark exhibits high robustness with prediction variations under different prompts remaining below 1%, demonstrating superior reliability compared to existing datasets. Our dataset and evaluation protocol are available at https://huggingface.co/datasets/luojunyu/FinMME and https://github.com/luo-junyu/FinMME.