M^3FinMeeting: 다국어, 다분야, 다중 과제를 위한 금융 회의 이해 평가 데이터셋

초록

대규모 언어 모델(LLM)의 최근 획기적인 발전은 금융 분야에서의 성능 평가를 위한 새로운 벤치마크 개발로 이어졌다. 그러나 현재의 금융 벤치마크는 뉴스 기사, 실적 보고서 또는 공시 자료에 의존하는 경우가 많아, 실제 금융 회의의 역동성을 포착하기 어려운 한계가 있다. 이러한 격차를 해결하기 위해, 우리는 금융 회의 이해를 위해 설계된 다국어, 다산업, 다중 작업 데이터셋인 M^3FinMeeting이라는 새로운 벤치마크를 제안한다. 첫째, M^3FinMeeting은 영어, 중국어, 일본어를 지원하여 다양한 언어적 맥락에서의 금융 논의 이해를 강화한다. 둘째, 이 벤치마크는 글로벌 산업 분류 표준(GICS)에 정의된 다양한 산업 분야를 포괄함으로써 광범위한 금융 활동을 아우른다. 마지막으로, M^3FinMeeting은 요약, 질문-답변(QA) 쌍 추출, 질문 응답이라는 세 가지 작업을 포함하여 보다 현실적이고 포괄적인 이해 평가를 가능하게 한다. 7개의 인기 있는 LLM을 사용한 실험 결과, 가장 발전된 장문맥 모델조차도 개선의 여지가 크다는 것이 밝혀졌으며, 이는 M^3FinMeeting이 LLM의 금융 회의 이해 능력을 평가하는 벤치마크로서의 효과적임을 입증한다.

English

Recent breakthroughs in large language models (LLMs) have led to the development of new benchmarks for evaluating their performance in the financial domain. However, current financial benchmarks often rely on news articles, earnings reports, or announcements, making it challenging to capture the real-world dynamics of financial meetings. To address this gap, we propose a novel benchmark called M^3FinMeeting, which is a multilingual, multi-sector, and multi-task dataset designed for financial meeting understanding. First, M^3FinMeeting supports English, Chinese, and Japanese, enhancing comprehension of financial discussions in diverse linguistic contexts. Second, it encompasses various industry sectors defined by the Global Industry Classification Standard (GICS), ensuring that the benchmark spans a broad range of financial activities. Finally, M^3FinMeeting includes three tasks: summarization, question-answer (QA) pair extraction, and question answering, facilitating a more realistic and comprehensive evaluation of understanding. Experimental results with seven popular LLMs reveal that even the most advanced long-context models have significant room for improvement, demonstrating the effectiveness of M^3FinMeeting as a benchmark for assessing LLMs' financial meeting comprehension skills.