M^3FinMeeting：一個多語言、多領域、多任務的財務會議理解評估數據集

摘要

近期大型語言模型（LLMs）的突破性進展，促成了用於評估其在金融領域表現的新基準的開發。然而，現有的金融基準往往依賴於新聞文章、財報或公告，這使得捕捉金融會議的現實動態變得頗具挑戰。為彌補這一不足，我們提出了一個名為M^3FinMeeting的新穎基準，這是一個專為金融會議理解設計的多語言、多行業、多任務的數據集。首先，M^3FinMeeting支持英語、中文和日語，增強了對多樣化語言環境下金融討論的理解能力。其次，它涵蓋了由全球行業分類標準（GICS）定義的多個行業部門，確保基準覆蓋廣泛的金融活動。最後，M^3FinMeeting包含三項任務：摘要生成、問答對提取及問題回答，促成了更為真實且全面的理解評估。通過對七種流行LLMs的實驗結果顯示，即便是最先進的長上下文模型仍有顯著的提升空間，這證明了M^3FinMeeting作為評估LLMs金融會議理解能力的基準的有效性。

English

Recent breakthroughs in large language models (LLMs) have led to the development of new benchmarks for evaluating their performance in the financial domain. However, current financial benchmarks often rely on news articles, earnings reports, or announcements, making it challenging to capture the real-world dynamics of financial meetings. To address this gap, we propose a novel benchmark called M^3FinMeeting, which is a multilingual, multi-sector, and multi-task dataset designed for financial meeting understanding. First, M^3FinMeeting supports English, Chinese, and Japanese, enhancing comprehension of financial discussions in diverse linguistic contexts. Second, it encompasses various industry sectors defined by the Global Industry Classification Standard (GICS), ensuring that the benchmark spans a broad range of financial activities. Finally, M^3FinMeeting includes three tasks: summarization, question-answer (QA) pair extraction, and question answering, facilitating a more realistic and comprehensive evaluation of understanding. Experimental results with seven popular LLMs reveal that even the most advanced long-context models have significant room for improvement, demonstrating the effectiveness of M^3FinMeeting as a benchmark for assessing LLMs' financial meeting comprehension skills.