MultiFinBen:一個多語言、多模態且難度感知的金融大語言模型評估基準
MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation
June 16, 2025
作者: Xueqing Peng, Lingfei Qian, Yan Wang, Ruoyu Xiang, Yueru He, Yang Ren, Mingyang Jiang, Jeff Zhao, Huan He, Yi Han, Yun Feng, Yuechen Jiang, Yupeng Cao, Haohang Li, Yangyang Yu, Xiaoyu Wang, Penglei Gao, Shengyuan Lin, Keyi Wang, Shanshan Yang, Yilun Zhao, Zhiwei Liu, Peng Lu, Jerry Huang, Suyuchen Wang, Triantafillos Papadopoulos, Polydoros Giannouris, Efstathia Soufleri, Nuo Chen, Guojun Xiong, Zhiyang Deng, Yijia Zhao, Mingquan Lin, Meikang Qiu, Kaleb E Smith, Arman Cohan, Xiao-Yang Liu, Jimin Huang, Alejandro Lopez-Lira, Xi Chen, Junichi Tsujii, Jian-Yun Nie, Sophia Ananiadou, Qianqian Xie
cs.AI
摘要
近期,大型語言模型(LLMs)的進展加速了金融自然語言處理(NLP)及其應用的發展,然而現有的基準測試仍局限於單語和單模態的環境,往往過度依賴於簡單任務,未能反映現實世界金融交流的複雜性。我們推出了MultiFinBen,這是首個專為全球金融領域量身定制的多語言多模態基準測試,評估LLMs在多模態(文本、視覺、音頻)及多語言環境(單語、雙語、多語)下執行特定領域任務的能力。我們引入了兩項新穎任務,包括PolyFiQA-Easy和PolyFiQA-Expert,這是首個要求模型對混合語言輸入進行複雜推理的多語言金融基準測試;以及EnglishOCR和SpanishOCR,這是首個嵌入光學字符識別(OCR)的金融問答任務,挑戰模型從視覺文本金融文檔中提取信息並進行推理。此外,我們提出了一種動態的、難度感知的選擇機制,並精心策劃了一個緊湊且平衡的基準測試,而非簡單地彙總現有數據集。對22個最先進模型的廣泛評估顯示,即便是最強大的模型,儘管具備一般的多模態和多語言能力,在面對金融領域複雜的跨語言和多模態任務時,也表現出顯著的困難。MultiFinBen已公開發布,旨在促進金融研究和應用中的透明、可重現及包容性的進步。
English
Recent advances in large language models (LLMs) have accelerated progress in
financial NLP and applications, yet existing benchmarks remain limited to
monolingual and unimodal settings, often over-relying on simple tasks and
failing to reflect the complexity of real-world financial communication. We
introduce MultiFinBen, the first multilingual and multimodal benchmark tailored
to the global financial domain, evaluating LLMs across modalities (text,
vision, audio) and linguistic settings (monolingual, bilingual, multilingual)
on domain-specific tasks. We introduce two novel tasks, including PolyFiQA-Easy
and PolyFiQA-Expert, the first multilingual financial benchmarks requiring
models to perform complex reasoning over mixed-language inputs; and EnglishOCR
and SpanishOCR, the first OCR-embedded financial QA tasks challenging models to
extract and reason over information from visual-text financial documents.
Moreover, we propose a dynamic, difficulty-aware selection mechanism and curate
a compact, balanced benchmark rather than simple aggregation existing datasets.
Extensive evaluation of 22 state-of-the-art models reveals that even the
strongest models, despite their general multimodal and multilingual
capabilities, struggle dramatically when faced with complex cross-lingual and
multimodal tasks in financial domain. MultiFinBen is publicly released to
foster transparent, reproducible, and inclusive progress in financial studies
and applications.