CURIE: Evaluatie van LLM's op multitask wetenschappelijk langetermijncontext Begrip en Redeneren

Samenvatting

Wetenschappelijk probleemoplossen omvat het synthetiseren van informatie terwijl expertkennis wordt toegepast. Wij introduceren CURIE, een wetenschappelijke benchmark voor Lang-Context Begrip, Redeneren en Informatie-extractie, om het potentieel van Large Language Models (LLMs) in wetenschappelijk probleemoplossen en het ondersteunen van wetenschappers in realistische workflows te meten. Deze benchmark introduceert tien uitdagende taken met in totaal 580 probleem- en oplossingsparen, samengesteld door experts in zes disciplines - materiaalkunde, gecondenseerde materie-fysica, quantumcomputing, geospatiale analyse, biodiversiteit en eiwitten - die zowel experimentele als theoretische workflows in de wetenschap bestrijken. We evalueren een reeks gesloten en open LLMs op taken in CURIE die domeinkennis, begrip van lange contextinformatie en meerstaps redeneren vereisen. Terwijl Gemini Flash 2.0 en Claude-3 consistent hoog begrip tonen over verschillende domeinen, falen de populaire GPT-4o en command-R+ dramatisch bij eiwitsequentietaken. Met de beste prestatie op 32% is er nog veel ruimte voor verbetering voor alle modellen. We hopen dat de inzichten die uit CURIE worden verkregen, de toekomstige ontwikkeling van LLMs in de wetenschap kunnen sturen. Evaluatiecode en gegevens zijn beschikbaar op https://github.com/google/curie.

English

Scientific problem-solving involves synthesizing information while applying expert knowledge. We introduce CURIE, a scientific long-Context Understanding,Reasoning and Information Extraction benchmark to measure the potential of Large Language Models (LLMs) in scientific problem-solving and assisting scientists in realistic workflows. This benchmark introduces ten challenging tasks with a total of 580 problems and solution pairs curated by experts in six disciplines - materials science, condensed matter physics, quantum computing, geospatial analysis, biodiversity, and proteins - covering both experimental and theoretical work-flows in science. We evaluate a range of closed and open LLMs on tasks in CURIE which requires domain expertise, comprehension of long in-context information,and multi-step reasoning. While Gemini Flash 2.0 and Claude-3 show consistent high comprehension across domains, the popular GPT-4o and command-R+ fail dramatically on protein sequencing tasks. With the best performance at 32% there is much room for improvement for all models. We hope that insights gained from CURIE can guide the future development of LLMs in sciences. Evaluation code and data are in https://github.com/google/curie

CURIE: Evaluatie van LLM's op multitask wetenschappelijk langetermijncontext Begrip en Redeneren

CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning

Samenvatting

Support