M^3FinMeeting：一个多语言、多领域、多任务的金融会议理解评估数据集

摘要

近期，大型语言模型（LLMs）的突破性进展催生了评估其在金融领域表现的新基准。然而，现有的金融基准多依赖于新闻报道、财报或公告，难以捕捉金融会议中的现实动态。为填补这一空白，我们提出了一个名为M^3FinMeeting的创新基准，这是一个专为金融会议理解设计的多语言、多行业、多任务数据集。首先，M^3FinMeeting支持英语、中文和日语，提升了在不同语言环境下对金融讨论的理解能力。其次，它涵盖了全球行业分类标准（GICS）定义的多个行业领域，确保基准覆盖广泛的金融活动。最后，M^3FinMeeting包含三项任务：摘要生成、问答对提取及问答，促进了更为真实和全面的理解评估。通过对七种流行LLMs的实验分析，结果显示即便是最先进的长上下文模型仍有显著提升空间，这证明了M^3FinMeeting作为评估LLMs金融会议理解能力基准的有效性。

English

Recent breakthroughs in large language models (LLMs) have led to the development of new benchmarks for evaluating their performance in the financial domain. However, current financial benchmarks often rely on news articles, earnings reports, or announcements, making it challenging to capture the real-world dynamics of financial meetings. To address this gap, we propose a novel benchmark called M^3FinMeeting, which is a multilingual, multi-sector, and multi-task dataset designed for financial meeting understanding. First, M^3FinMeeting supports English, Chinese, and Japanese, enhancing comprehension of financial discussions in diverse linguistic contexts. Second, it encompasses various industry sectors defined by the Global Industry Classification Standard (GICS), ensuring that the benchmark spans a broad range of financial activities. Finally, M^3FinMeeting includes three tasks: summarization, question-answer (QA) pair extraction, and question answering, facilitating a more realistic and comprehensive evaluation of understanding. Experimental results with seven popular LLMs reveal that even the most advanced long-context models have significant room for improvement, demonstrating the effectiveness of M^3FinMeeting as a benchmark for assessing LLMs' financial meeting comprehension skills.