SciBench: Valutazione delle Capacità di Risoluzione di Problemi Scientifici a Livello Universitario nei Modelli Linguistici di Grandi Dimensioni

Abstract

I recenti progressi nei modelli linguistici di grandi dimensioni (LLMs) hanno dimostrato notevoli miglioramenti su molti benchmark matematici. Tuttavia, la maggior parte di questi benchmark presenta solo problemi basati su materie delle scuole medie e superiori, contiene esclusivamente domande a scelta multipla ed è limitata a un ambito ristretto di operazioni aritmetiche elementari. Per affrontare queste problematiche, questo articolo introduce una suite di benchmark estesa, denominata SciBench, che mira a esaminare sistematicamente le capacità di ragionamento necessarie per la risoluzione di problemi scientifici complessi. SciBench comprende due dataset accuratamente curati: un insieme aperto che presenta una gamma di problemi scientifici di livello universitario tratti da libri di testo di matematica, chimica e fisica, e un insieme chiuso composto da problemi tratti da esami di livello universitario in informatica e matematica. Basandoci su questi due dataset, conduciamo uno studio approfondito di benchmark su due LLM rappresentativi con varie strategie di prompting. I risultati rivelano che gli attuali LLM non riescono a fornire prestazioni soddisfacenti, con un punteggio complessivo di appena il 35,80%. Inoltre, attraverso uno studio dettagliato con utenti, categorizziamo gli errori commessi dagli LLM in dieci abilità di problem solving. La nostra analisi indica che nessuna singola strategia di prompting supera significativamente le altre e che alcune strategie che dimostrano miglioramenti in determinate abilità di problem solving portano a un peggioramento in altre. Prevediamo che SciBench catalizzerà ulteriori sviluppi nelle capacità di ragionamento degli LLM, contribuendo così in ultima analisi alla ricerca e alla scoperta scientifica.

English

Recent advances in large language models (LLMs) have demonstrated notable progress on many mathematical benchmarks. However, most of these benchmarks only feature problems grounded in junior and senior high school subjects, contain only multiple-choice questions, and are confined to a limited scope of elementary arithmetic operations. To address these issues, this paper introduces an expansive benchmark suite SciBench that aims to systematically examine the reasoning capabilities required for complex scientific problem solving. SciBench contains two carefully curated datasets: an open set featuring a range of collegiate-level scientific problems drawn from mathematics, chemistry, and physics textbooks, and a closed set comprising problems from undergraduate-level exams in computer science and mathematics. Based on the two datasets, we conduct an in-depth benchmark study of two representative LLMs with various prompting strategies. The results reveal that current LLMs fall short of delivering satisfactory performance, with an overall score of merely 35.80%. Furthermore, through a detailed user study, we categorize the errors made by LLMs into ten problem-solving abilities. Our analysis indicates that no single prompting strategy significantly outperforms others and some strategies that demonstrate improvements in certain problem-solving skills result in declines in other skills. We envision that SciBench will catalyze further developments in the reasoning abilities of LLMs, thereby ultimately contributing to scientific research and discovery.

SciBench: Valutazione delle Capacità di Risoluzione di Problemi Scientifici a Livello Universitario nei Modelli Linguistici di Grandi Dimensioni

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Abstract

Support