MultiFinBen:面向金融大语言模型评估的多语言、多模态及难度感知基准
MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation
June 16, 2025
作者: Xueqing Peng, Lingfei Qian, Yan Wang, Ruoyu Xiang, Yueru He, Yang Ren, Mingyang Jiang, Jeff Zhao, Huan He, Yi Han, Yun Feng, Yuechen Jiang, Yupeng Cao, Haohang Li, Yangyang Yu, Xiaoyu Wang, Penglei Gao, Shengyuan Lin, Keyi Wang, Shanshan Yang, Yilun Zhao, Zhiwei Liu, Peng Lu, Jerry Huang, Suyuchen Wang, Triantafillos Papadopoulos, Polydoros Giannouris, Efstathia Soufleri, Nuo Chen, Guojun Xiong, Zhiyang Deng, Yijia Zhao, Mingquan Lin, Meikang Qiu, Kaleb E Smith, Arman Cohan, Xiao-Yang Liu, Jimin Huang, Alejandro Lopez-Lira, Xi Chen, Junichi Tsujii, Jian-Yun Nie, Sophia Ananiadou, Qianqian Xie
cs.AI
摘要
近期,大型语言模型(LLMs)的进展加速了金融自然语言处理(NLP)及其应用的发展,然而现有基准测试仍局限于单语和单模态环境,往往过度依赖简单任务,未能反映现实世界金融交流的复杂性。我们推出了MultiFinBen,这是首个专为全球金融领域量身定制的多语言多模态基准测试,评估LLMs在多种模态(文本、视觉、音频)和语言环境(单语、双语、多语)下执行领域特定任务的能力。我们引入了两项创新任务:PolyFiQA-Easy和PolyFiQA-Expert,这是首个要求模型对混合语言输入进行复杂推理的多语言金融基准测试;以及EnglishOCR和SpanishOCR,这是首个嵌入OCR技术的金融问答任务,挑战模型从视觉文本金融文档中提取信息并进行推理。此外,我们提出了一种动态的、难度感知的选择机制,精心策划了一个紧凑且平衡的基准测试,而非简单聚合现有数据集。对22个顶尖模型的广泛评估显示,即便是最强大的模型,尽管具备通用的多模态和多语言能力,在面对金融领域复杂的跨语言和多模态任务时也表现出了显著的困难。MultiFinBen已公开发布,旨在促进金融研究和应用中的透明、可重复及包容性进步。
English
Recent advances in large language models (LLMs) have accelerated progress in
financial NLP and applications, yet existing benchmarks remain limited to
monolingual and unimodal settings, often over-relying on simple tasks and
failing to reflect the complexity of real-world financial communication. We
introduce MultiFinBen, the first multilingual and multimodal benchmark tailored
to the global financial domain, evaluating LLMs across modalities (text,
vision, audio) and linguistic settings (monolingual, bilingual, multilingual)
on domain-specific tasks. We introduce two novel tasks, including PolyFiQA-Easy
and PolyFiQA-Expert, the first multilingual financial benchmarks requiring
models to perform complex reasoning over mixed-language inputs; and EnglishOCR
and SpanishOCR, the first OCR-embedded financial QA tasks challenging models to
extract and reason over information from visual-text financial documents.
Moreover, we propose a dynamic, difficulty-aware selection mechanism and curate
a compact, balanced benchmark rather than simple aggregation existing datasets.
Extensive evaluation of 22 state-of-the-art models reveals that even the
strongest models, despite their general multimodal and multilingual
capabilities, struggle dramatically when faced with complex cross-lingual and
multimodal tasks in financial domain. MultiFinBen is publicly released to
foster transparent, reproducible, and inclusive progress in financial studies
and applications.