FinBen: 대규모 언어 모델을 위한 종합적 금융 벤치마크

초록

LLM(대형 언어 모델)은 NLP(자연어 처리)를 혁신적으로 변화시켰으며 다양한 분야에서 잠재력을 보여주고 있지만, 금융 분야에서의 가능성은 철저한 평가의 부족과 금융 업무의 복잡성으로 인해 충분히 탐구되지 못했습니다. 이는 LLM의 급속한 발전과 더불어, LLM을 위한 체계적인 금융 평가 벤치마크의 시급한 필요성을 강조합니다. 본 논문에서는 금융 분야에서 LLM의 능력을 철저히 평가하기 위해 특별히 설계된 최초의 포괄적인 오픈소스 평가 벤치마크인 FinBen을 소개합니다. FinBen은 Cattell-Horn-Carroll 이론에서 영감을 받은 세 가지 난이도 스펙트럼으로 구성된 23개의 금융 작업에 걸친 35개의 데이터셋을 포함하며, LLM의 귀납적 추론, 연상 기억, 정량적 추론, 결정화 지능 등의 인지 능력을 평가합니다. GPT-4, ChatGPT, 최신 Gemini를 포함한 15개의 대표적인 LLM을 평가한 결과, 금융 분야에서의 강점과 한계에 대한 통찰을 얻었습니다. 연구 결과에 따르면, GPT-4는 정량화, 추출, 수치 추론 및 주식 거래에서 선두를 달리는 반면, Gemini는 생성 및 예측에서 두각을 나타냈습니다. 그러나 둘 다 복잡한 추출과 예측에서 어려움을 겪으며, 목표 지향적인 개선의 필요성이 명확히 드러났습니다. 지시 튜닝은 단순 작업 성능을 향상시키지만, 복잡한 추론 및 예측 능력 개선에는 한계가 있습니다. FinBen은 금융 분야에서 LLM을 지속적으로 평가하고, 작업과 모델의 정기적인 업데이트를 통해 AI 발전을 촉진하고자 합니다.

English

LLMs have transformed NLP and shown promise in various fields, yet their potential in finance is underexplored due to a lack of thorough evaluations and the complexity of financial tasks. This along with the rapid development of LLMs, highlights the urgent need for a systematic financial evaluation benchmark for LLMs. In this paper, we introduce FinBen, the first comprehensive open-sourced evaluation benchmark, specifically designed to thoroughly assess the capabilities of LLMs in the financial domain. FinBen encompasses 35 datasets across 23 financial tasks, organized into three spectrums of difficulty inspired by the Cattell-Horn-Carroll theory, to evaluate LLMs' cognitive abilities in inductive reasoning, associative memory, quantitative reasoning, crystallized intelligence, and more. Our evaluation of 15 representative LLMs, including GPT-4, ChatGPT, and the latest Gemini, reveals insights into their strengths and limitations within the financial domain. The findings indicate that GPT-4 leads in quantification, extraction, numerical reasoning, and stock trading, while Gemini shines in generation and forecasting; however, both struggle with complex extraction and forecasting, showing a clear need for targeted enhancements. Instruction tuning boosts simple task performance but falls short in improving complex reasoning and forecasting abilities. FinBen seeks to continuously evaluate LLMs in finance, fostering AI development with regular updates of tasks and models.

FinBen: 대규모 언어 모델을 위한 종합적 금융 벤치마크

The FinBen: An Holistic Financial Benchmark for Large Language Models

초록

Support