ChatPaper.aiChatPaper

CS-Bench:面向计算机科学掌握的大型语言模型综合基准

CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery

June 12, 2024
作者: Xiaoshuai Song, Muxi Diao, Guanting Dong, Zhengyang Wang, Yujia Fu, Runqi Qiao, Zhexu Wang, Dayuan Fu, Huangxuan Wu, Bin Liang, Weihao Zeng, Yejie Wang, Zhuoma GongQue, Jianing Yu, Qiuna Tan, Weiran Xu
cs.AI

摘要

计算机科学(CS)作为人类智能复杂性的明证,深刻推动了人工智能和现代社会的发展。然而,当前大型语言模型(LLMs)的社区过分关注分析特定基础技能(如数学和代码生成)的基准,忽视了对计算机科学领域的全面评估。为了弥合这一差距,我们引入了CS-Bench,这是第一个专门用于评估LLMs在计算机科学中表现的双语(中英文)基准。CS-Bench包括约5K个精心策划的测试样本,涵盖了计算机科学的4个关键领域中的26个子领域,包括各种任务形式和知识推理的划分。利用CS-Bench,我们对30多个主流LLMs进行了全面评估,揭示了计算机科学表现与模型规模之间的关系。我们还定量分析了现有LLMs失败的原因,并强调了改进方向,包括知识补充和计算机科学特定推理。进一步的跨能力实验显示了LLMs在计算机科学方面的能力与它们在数学和编码方面的能力之间存在很高的相关性。此外,专门从事数学和编码的专家LLMs在几个CS子领域中也表现出色。展望未来,我们设想CS-Bench将成为LLMs在CS领域应用的基石,并在评估LLMs多样推理能力方面开辟新途径。CS-Bench的数据和评估代码可在https://github.com/csbench/csbench 上获取。
English
Computer Science (CS) stands as a testament to the intricacies of human intelligence, profoundly advancing the development of artificial intelligence and modern society. However, the current community of large language models (LLMs) overly focuses on benchmarks for analyzing specific foundational skills (e.g. mathematics and code generation), neglecting an all-round evaluation of the computer science field. To bridge this gap, we introduce CS-Bench, the first bilingual (Chinese-English) benchmark dedicated to evaluating the performance of LLMs in computer science. CS-Bench comprises approximately 5K meticulously curated test samples, covering 26 subfields across 4 key areas of computer science, encompassing various task forms and divisions of knowledge and reasoning. Utilizing CS-Bench, we conduct a comprehensive evaluation of over 30 mainstream LLMs, revealing the relationship between CS performance and model scales. We also quantitatively analyze the reasons for failures in existing LLMs and highlight directions for improvements, including knowledge supplementation and CS-specific reasoning. Further cross-capability experiments show a high correlation between LLMs' capabilities in computer science and their abilities in mathematics and coding. Moreover, expert LLMs specialized in mathematics and coding also demonstrate strong performances in several CS subfields. Looking ahead, we envision CS-Bench serving as a cornerstone for LLM applications in the CS field and paving new avenues in assessing LLMs' diverse reasoning capabilities. The CS-Bench data and evaluation code are available at https://github.com/csbench/csbench.

Summary

AI-Generated Summary

PDF164December 6, 2024