FinMME：金融マルチモーダル推論評価のためのベンチマークデータセット

要旨

マルチモーダル大規模言語モデル（MLLMs）は近年急速な発展を遂げてきた。しかし、金融分野においては、効果的で専門的なマルチモーダル評価データセットが著しく不足している。金融分野におけるMLLMsの発展を推進するため、我々はFinMMEを導入した。FinMMEは、18の金融分野と6つの資産クラスにわたる11,000以上の高品質な金融研究サンプルを網羅し、10の主要なチャートタイプと21のサブタイプを特徴としている。データ品質は20人のアノテーターと慎重に設計された検証メカニズムを通じて確保されている。さらに、幻覚ペナルティと多次元能力評価を組み込んだ評価システムであるFinScoreを開発し、偏りのない評価を提供する。広範な実験結果は、GPT-4oのような最先端のモデルでさえFinMMEでのパフォーマンスが不十分であることを示しており、その挑戦的な性質を浮き彫りにしている。このベンチマークは高いロバスト性を示し、異なるプロンプト下での予測変動は1%未満であり、既存のデータセットと比較して優れた信頼性を実証している。我々のデータセットと評価プロトコルは、https://huggingface.co/datasets/luojunyu/FinMME および https://github.com/luo-junyu/FinMME で利用可能である。

English

Multimodal Large Language Models (MLLMs) have experienced rapid development in recent years. However, in the financial domain, there is a notable lack of effective and specialized multimodal evaluation datasets. To advance the development of MLLMs in the finance domain, we introduce FinMME, encompassing more than 11,000 high-quality financial research samples across 18 financial domains and 6 asset classes, featuring 10 major chart types and 21 subtypes. We ensure data quality through 20 annotators and carefully designed validation mechanisms. Additionally, we develop FinScore, an evaluation system incorporating hallucination penalties and multi-dimensional capability assessment to provide an unbiased evaluation. Extensive experimental results demonstrate that even state-of-the-art models like GPT-4o exhibit unsatisfactory performance on FinMME, highlighting its challenging nature. The benchmark exhibits high robustness with prediction variations under different prompts remaining below 1%, demonstrating superior reliability compared to existing datasets. Our dataset and evaluation protocol are available at https://huggingface.co/datasets/luojunyu/FinMME and https://github.com/luo-junyu/FinMME.

FinMME：金融マルチモーダル推論評価のためのベンチマークデータセット

FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation

要旨

Support