M^3FinMeeting：多言語・多分野・多タスクの財務会議理解評価データセット

要旨

大規模言語モデル（LLM）の最近のブレークスルーにより、金融領域における性能評価のための新しいベンチマークの開発が進んでいます。しかし、現在の金融ベンチマークは、ニュース記事、決算報告書、または発表に依存することが多く、金融会議の現実世界のダイナミクスを捉えることが困難です。このギャップを埋めるため、我々はM^3FinMeetingという新しいベンチマークを提案します。これは、金融会議の理解を目的とした多言語、多セクター、多タスクのデータセットです。まず、M^3FinMeetingは英語、中国語、日本語をサポートし、多様な言語環境での金融ディスカッションの理解を強化します。次に、グローバル産業分類基準（GICS）に基づいて定義されたさまざまな産業セクターを網羅し、ベンチマークが幅広い金融活動をカバーすることを保証します。最後に、M^3FinMeetingは、要約、質問応答（QA）ペア抽出、および質問応答の3つのタスクを含み、より現実的で包括的な理解評価を可能にします。7つの人気LLMを用いた実験結果は、最も先進的な長文脈モデルでさえも改善の余地が大きいことを明らかにし、M^3FinMeetingがLLMの金融会議理解スキルを評価するベンチマークとして有効であることを示しています。

English

Recent breakthroughs in large language models (LLMs) have led to the development of new benchmarks for evaluating their performance in the financial domain. However, current financial benchmarks often rely on news articles, earnings reports, or announcements, making it challenging to capture the real-world dynamics of financial meetings. To address this gap, we propose a novel benchmark called M^3FinMeeting, which is a multilingual, multi-sector, and multi-task dataset designed for financial meeting understanding. First, M^3FinMeeting supports English, Chinese, and Japanese, enhancing comprehension of financial discussions in diverse linguistic contexts. Second, it encompasses various industry sectors defined by the Global Industry Classification Standard (GICS), ensuring that the benchmark spans a broad range of financial activities. Finally, M^3FinMeeting includes three tasks: summarization, question-answer (QA) pair extraction, and question answering, facilitating a more realistic and comprehensive evaluation of understanding. Experimental results with seven popular LLMs reveal that even the most advanced long-context models have significant room for improvement, demonstrating the effectiveness of M^3FinMeeting as a benchmark for assessing LLMs' financial meeting comprehension skills.

M^3FinMeeting：多言語・多分野・多タスクの財務会議理解評価データセット

M^3FinMeeting: A Multilingual, Multi-Sector, and Multi-Task Financial Meeting Understanding Evaluation Dataset

要旨

Support