CURIE:評估大型語言模型在多任務科學長文本上的理解與推理能力
CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning
March 14, 2025
作者: Hao Cui, Zahra Shamsi, Gowoon Cheon, Xuejian Ma, Shutong Li, Maria Tikhanovskaya, Peter Norgaard, Nayantara Mudur, Martyna Plomecka, Paul Raccuglia, Yasaman Bahri, Victor V. Albert, Pranesh Srinivasan, Haining Pan, Philippe Faist, Brian Rohr, Michael J. Statt, Dan Morris, Drew Purves, Elise Kleeman, Ruth Alcantara, Matthew Abraham, Muqthar Mohammad, Ean Phing VanLee, Chenfei Jiang, Elizabeth Dorfman, Eun-Ah Kim, Michael P Brenner, Viren Jain, Sameera Ponda, Subhashini Venugopalan
cs.AI
摘要
科學問題解決涉及在應用專家知識的同時綜合信息。我們引入了CURIE,一個科學長上下文理解、推理與信息提取的基準測試,旨在衡量大型語言模型(LLMs)在科學問題解決及協助科學家進行現實工作流程中的潛力。該基準測試涵蓋了六個學科——材料科學、凝聚態物理、量子計算、地理空間分析、生物多樣性及蛋白質——共計580個由專家精心挑選的問題與解決方案對,覆蓋了科學中的實驗與理論工作流程。我們評估了一系列封閉與開放的LLMs在CURIE任務上的表現,這些任務要求具備領域專業知識、對長上下文信息的理解以及多步推理能力。儘管Gemini Flash 2.0和Claude-3在各領域展現出持續的高理解力,但廣受歡迎的GPT-4o和command-R+在蛋白質測序任務上表現卻大幅落後。所有模型的最佳表現僅為32%,表明仍有很大的改進空間。我們希望從CURIE中獲得的見解能引導LLMs在科學領域的未來發展。評估代碼與數據可在https://github.com/google/curie獲取。
English
Scientific problem-solving involves synthesizing information while applying
expert knowledge. We introduce CURIE, a scientific long-Context
Understanding,Reasoning and Information Extraction benchmark to measure the
potential of Large Language Models (LLMs) in scientific problem-solving and
assisting scientists in realistic workflows. This benchmark introduces ten
challenging tasks with a total of 580 problems and solution pairs curated by
experts in six disciplines - materials science, condensed matter physics,
quantum computing, geospatial analysis, biodiversity, and proteins - covering
both experimental and theoretical work-flows in science. We evaluate a range of
closed and open LLMs on tasks in CURIE which requires domain expertise,
comprehension of long in-context information,and multi-step reasoning. While
Gemini Flash 2.0 and Claude-3 show consistent high comprehension across
domains, the popular GPT-4o and command-R+ fail dramatically on protein
sequencing tasks. With the best performance at 32% there is much room for
improvement for all models. We hope that insights gained from CURIE can guide
the future development of LLMs in sciences. Evaluation code and data are in
https://github.com/google/curieSummary
AI-Generated Summary