FinBen：大型语言模型的全面财务基准

摘要

LLM已经改变了自然语言处理，并在各个领域展现了潜力，然而由于缺乏彻底的评估和金融任务的复杂性，它们在金融领域的潜力尚未得到充分挖掘。这与LLM的快速发展一起，突显了迫切需要为LLM建立系统化金融评估基准的重要性。在本文中，我们介绍了FinBen，这是第一个全面的开源评估基准，专门设计用于全面评估LLM在金融领域的能力。FinBen包括23个金融任务的35个数据集，根据卡特尔-霍恩-卡罗尔理论启发，组织成三个难度级别，以评估LLM在归纳推理、联想记忆、数量推理、结晶智力等方面的认知能力。我们对包括GPT-4、ChatGPT和最新的Gemini在内的15个代表性LLM进行评估，揭示了它们在金融领域内的优势和局限性。研究结果表明，GPT-4在量化、提取、数值推理和股票交易方面领先，而Gemini在生成和预测方面表现出色；然而，两者在复杂提取和预测方面都存在困难，明显需要有针对性的增强。指导调整可以提升简单任务的表现，但在改善复杂推理和预测能力方面表现不佳。FinBen旨在持续评估金融领域中的LLM，通过定期更新任务和模型，促进人工智能的发展。

English

LLMs have transformed NLP and shown promise in various fields, yet their potential in finance is underexplored due to a lack of thorough evaluations and the complexity of financial tasks. This along with the rapid development of LLMs, highlights the urgent need for a systematic financial evaluation benchmark for LLMs. In this paper, we introduce FinBen, the first comprehensive open-sourced evaluation benchmark, specifically designed to thoroughly assess the capabilities of LLMs in the financial domain. FinBen encompasses 35 datasets across 23 financial tasks, organized into three spectrums of difficulty inspired by the Cattell-Horn-Carroll theory, to evaluate LLMs' cognitive abilities in inductive reasoning, associative memory, quantitative reasoning, crystallized intelligence, and more. Our evaluation of 15 representative LLMs, including GPT-4, ChatGPT, and the latest Gemini, reveals insights into their strengths and limitations within the financial domain. The findings indicate that GPT-4 leads in quantification, extraction, numerical reasoning, and stock trading, while Gemini shines in generation and forecasting; however, both struggle with complex extraction and forecasting, showing a clear need for targeted enhancements. Instruction tuning boosts simple task performance but falls short in improving complex reasoning and forecasting abilities. FinBen seeks to continuously evaluate LLMs in finance, fostering AI development with regular updates of tasks and models.

FinBen：大型语言模型的全面财务基准

The FinBen: An Holistic Financial Benchmark for Large Language Models

摘要

Support