FinBen：大型語言模型的全面財務基準

摘要

LLM已經改變了自然語言處理並在各個領域展示了潛力，然而在金融領域的潛力尚未被充分探索，這是因為缺乏深入評估以及金融任務的複雜性。這與LLM的快速發展一起，凸顯了迫切需要為LLM建立系統性金融評估基準的重要性。在本文中，我們介紹了FinBen，這是第一個全面的開源評估基準，專門設計來全面評估LLM在金融領域的能力。FinBen包含了23個金融任務的35個數據集，這些任務根據Cattell-Horn-Carroll理論的啟發，分為三個難度範疇，用於評估LLM在歸納推理、聯想記憶、定量推理、晶化智力等方面的認知能力。我們對15個代表性的LLM進行了評估，包括GPT-4、ChatGPT和最新的Gemini，在金融領域揭示了它們的優勢和局限性。研究結果顯示，GPT-4在量化、提取、數值推理和股票交易方面領先，而Gemini在生成和預測方面表現出色；然而，兩者在複雜提取和預測方面都遇到困難，顯示了對針對性增強的明顯需求。指導調整可以提高簡單任務的表現，但在改善複雜推理和預測能力方面表現不佳。FinBen旨在持續評估金融領域的LLM，通過定期更新任務和模型，促進AI的發展。

English

LLMs have transformed NLP and shown promise in various fields, yet their potential in finance is underexplored due to a lack of thorough evaluations and the complexity of financial tasks. This along with the rapid development of LLMs, highlights the urgent need for a systematic financial evaluation benchmark for LLMs. In this paper, we introduce FinBen, the first comprehensive open-sourced evaluation benchmark, specifically designed to thoroughly assess the capabilities of LLMs in the financial domain. FinBen encompasses 35 datasets across 23 financial tasks, organized into three spectrums of difficulty inspired by the Cattell-Horn-Carroll theory, to evaluate LLMs' cognitive abilities in inductive reasoning, associative memory, quantitative reasoning, crystallized intelligence, and more. Our evaluation of 15 representative LLMs, including GPT-4, ChatGPT, and the latest Gemini, reveals insights into their strengths and limitations within the financial domain. The findings indicate that GPT-4 leads in quantification, extraction, numerical reasoning, and stock trading, while Gemini shines in generation and forecasting; however, both struggle with complex extraction and forecasting, showing a clear need for targeted enhancements. Instruction tuning boosts simple task performance but falls short in improving complex reasoning and forecasting abilities. FinBen seeks to continuously evaluate LLMs in finance, fostering AI development with regular updates of tasks and models.

FinBen：大型語言模型的全面財務基準

The FinBen: An Holistic Financial Benchmark for Large Language Models

摘要

Support