FinMME: 금융 다중 모달 추론 평가를 위한 벤치마크 데이터셋

초록

다중모드 대형 언어 모델(Multimodal Large Language Models, MLLMs)은 최근 몇 년 동안 급속한 발전을 이루어 왔습니다. 그러나 금융 분야에서는 효과적이고 전문적인 다중모드 평가 데이터셋이 현저히 부족한 상황입니다. 금융 분야에서 MLLMs의 발전을 촉진하기 위해, 우리는 FinMME를 소개합니다. FinMME는 18개의 금융 도메인과 6개의 자산 클래스에 걸쳐 10가지 주요 차트 유형과 21가지 하위 유형을 포함한 11,000개 이상의 고품질 금융 연구 샘플을 포괄합니다. 우리는 20명의 주석자와 신중하게 설계된 검증 메커니즘을 통해 데이터 품질을 보장합니다. 또한, 환각 패널티와 다차원 능력 평가를 통합한 평가 시스템인 FinScore를 개발하여 편견 없는 평가를 제공합니다. 광범위한 실험 결과는 GPT-4o와 같은 최첨단 모델조차 FinMME에서 만족스럽지 못한 성능을 보여주며, 이 데이터셋의 도전적인 특성을 강조합니다. 이 벤치마크는 다양한 프롬프트 하에서 예측 변동이 1% 미만으로 유지되며 높은 견고성을 보여주며, 기존 데이터셋에 비해 우수한 신뢰성을 입증합니다. 우리의 데이터셋과 평가 프로토콜은 https://huggingface.co/datasets/luojunyu/FinMME와 https://github.com/luo-junyu/FinMME에서 확인할 수 있습니다.

English

Multimodal Large Language Models (MLLMs) have experienced rapid development in recent years. However, in the financial domain, there is a notable lack of effective and specialized multimodal evaluation datasets. To advance the development of MLLMs in the finance domain, we introduce FinMME, encompassing more than 11,000 high-quality financial research samples across 18 financial domains and 6 asset classes, featuring 10 major chart types and 21 subtypes. We ensure data quality through 20 annotators and carefully designed validation mechanisms. Additionally, we develop FinScore, an evaluation system incorporating hallucination penalties and multi-dimensional capability assessment to provide an unbiased evaluation. Extensive experimental results demonstrate that even state-of-the-art models like GPT-4o exhibit unsatisfactory performance on FinMME, highlighting its challenging nature. The benchmark exhibits high robustness with prediction variations under different prompts remaining below 1%, demonstrating superior reliability compared to existing datasets. Our dataset and evaluation protocol are available at https://huggingface.co/datasets/luojunyu/FinMME and https://github.com/luo-junyu/FinMME.

FinMME: 금융 다중 모달 추론 평가를 위한 벤치마크 데이터셋

FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation

초록

Support