FinBen: Un Benchmark Finanziario Olistico per i Modelli Linguistici di Grandi Dimensioni

Abstract

I LLM hanno trasformato l'NLP e dimostrato potenziale in vari campi, ma il loro impatto nel settore finanziario rimane poco esplorato a causa della mancanza di valutazioni approfondite e della complessità delle attività finanziarie. Questo, unito al rapido sviluppo dei LLM, evidenzia l'urgente necessità di un benchmark sistematico di valutazione finanziaria per questi modelli. In questo articolo, presentiamo FinBen, il primo benchmark di valutazione open-source e completo, progettato specificamente per valutare in modo approfondito le capacità dei LLM nel dominio finanziario. FinBen comprende 35 dataset relativi a 23 task finanziari, organizzati in tre livelli di difficoltà ispirati alla teoria Cattell-Horn-Carroll, per valutare le abilità cognitive dei LLM nel ragionamento induttivo, nella memoria associativa, nel ragionamento quantitativo, nell'intelligenza cristallizzata e altro ancora. La nostra valutazione di 15 LLM rappresentativi, tra cui GPT-4, ChatGPT e il più recente Gemini, rivela punti di forza e limitazioni nel contesto finanziario. I risultati indicano che GPT-4 eccelle in quantificazione, estrazione, ragionamento numerico e trading azionario, mentre Gemini brilla nella generazione e nella previsione; tuttavia, entrambi mostrano difficoltà nell'estrazione complessa e nella previsione, evidenziando la necessità di miglioramenti mirati. L'instruction tuning migliora le prestazioni nei task semplici, ma non è sufficiente per potenziare il ragionamento complesso e le capacità di previsione. FinBen mira a valutare continuamente i LLM nel settore finanziario, promuovendo lo sviluppo dell'IA attraverso aggiornamenti regolari di task e modelli.

English

LLMs have transformed NLP and shown promise in various fields, yet their potential in finance is underexplored due to a lack of thorough evaluations and the complexity of financial tasks. This along with the rapid development of LLMs, highlights the urgent need for a systematic financial evaluation benchmark for LLMs. In this paper, we introduce FinBen, the first comprehensive open-sourced evaluation benchmark, specifically designed to thoroughly assess the capabilities of LLMs in the financial domain. FinBen encompasses 35 datasets across 23 financial tasks, organized into three spectrums of difficulty inspired by the Cattell-Horn-Carroll theory, to evaluate LLMs' cognitive abilities in inductive reasoning, associative memory, quantitative reasoning, crystallized intelligence, and more. Our evaluation of 15 representative LLMs, including GPT-4, ChatGPT, and the latest Gemini, reveals insights into their strengths and limitations within the financial domain. The findings indicate that GPT-4 leads in quantification, extraction, numerical reasoning, and stock trading, while Gemini shines in generation and forecasting; however, both struggle with complex extraction and forecasting, showing a clear need for targeted enhancements. Instruction tuning boosts simple task performance but falls short in improving complex reasoning and forecasting abilities. FinBen seeks to continuously evaluate LLMs in finance, fostering AI development with regular updates of tasks and models.

FinBen: Un Benchmark Finanziario Olistico per i Modelli Linguistici di Grandi Dimensioni

The FinBen: An Holistic Financial Benchmark for Large Language Models

Abstract

Support