ChatPaper.aiChatPaper

FinMME:金融多模态推理评估基准数据集

FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation

May 30, 2025
作者: Junyu Luo, Zhizhuo Kou, Liming Yang, Xiao Luo, Jinsheng Huang, Zhiping Xiao, Jingshu Peng, Chengzhong Liu, Jiaming Ji, Xuanzhe Liu, Sirui Han, Ming Zhang, Yike Guo
cs.AI

摘要

近年来,多模态大语言模型(MLLMs)发展迅速。然而,在金融领域,显著缺乏有效且专业的多模态评估数据集。为推进MLLMs在金融领域的发展,我们推出了FinMME,涵盖18个金融领域和6种资产类别中的超过11,000个高质量金融研究样本,包含10种主要图表类型及21种子类型。我们通过20名标注员和精心设计的验证机制确保数据质量。此外,我们开发了FinScore评估系统,结合幻觉惩罚和多维度能力评估,以提供公正的评价。大量实验结果表明,即便是GPT-4o等顶尖模型在FinMME上的表现也不尽如人意,凸显了其挑战性。该基准在不同提示下的预测波动保持在1%以下,展现出高鲁棒性,相较于现有数据集具有更优的可靠性。我们的数据集和评估协议可在https://huggingface.co/datasets/luojunyu/FinMME和https://github.com/luo-junyu/FinMME获取。
English
Multimodal Large Language Models (MLLMs) have experienced rapid development in recent years. However, in the financial domain, there is a notable lack of effective and specialized multimodal evaluation datasets. To advance the development of MLLMs in the finance domain, we introduce FinMME, encompassing more than 11,000 high-quality financial research samples across 18 financial domains and 6 asset classes, featuring 10 major chart types and 21 subtypes. We ensure data quality through 20 annotators and carefully designed validation mechanisms. Additionally, we develop FinScore, an evaluation system incorporating hallucination penalties and multi-dimensional capability assessment to provide an unbiased evaluation. Extensive experimental results demonstrate that even state-of-the-art models like GPT-4o exhibit unsatisfactory performance on FinMME, highlighting its challenging nature. The benchmark exhibits high robustness with prediction variations under different prompts remaining below 1%, demonstrating superior reliability compared to existing datasets. Our dataset and evaluation protocol are available at https://huggingface.co/datasets/luojunyu/FinMME and https://github.com/luo-junyu/FinMME.

Summary

AI-Generated Summary

PDF333June 4, 2025