CS-Bench: コンピュータサイエンス習得に向けた大規模言語モデルの包括的ベンチマーク

要旨

コンピュータサイエンス（CS）は、人間の知性の複雑さを象徴するものであり、人工知能と現代社会の発展に大きく貢献してきました。しかし、現在の大規模言語モデル（LLM）コミュニティは、特定の基礎スキル（例：数学やコード生成）の分析に焦点を当てすぎており、コンピュータサイエンス分野の総合的な評価を軽視しています。このギャップを埋めるため、我々はCS-Benchを導入します。これは、LLMのコンピュータサイエンスにおける性能を評価するための初の二言語（中国語-英語）ベンチマークです。CS-Benchは約5,000の厳選されたテストサンプルで構成され、コンピュータサイエンスの4つの主要領域にわたる26のサブフィールドをカバーし、さまざまなタスク形式と知識・推論の区分を含んでいます。CS-Benchを活用して、我々は30以上の主要なLLMを包括的に評価し、CS性能とモデル規模の関係を明らかにしました。また、既存のLLMの失敗の原因を定量的に分析し、知識の補完やCS特有の推論など、改善の方向性を強調しました。さらに、クロス能力実験により、LLMのコンピュータサイエンス能力と数学・コーディング能力の間に高い相関があることが示されました。また、数学やコーディングに特化した専門LLMも、いくつかのCSサブフィールドで強力なパフォーマンスを示しました。今後、CS-BenchがLLMのCS分野での応用の基盤となり、LLMの多様な推論能力を評価する新たな道を切り開くことを期待しています。CS-Benchのデータと評価コードはhttps://github.com/csbench/csbenchで公開されています。

English

Computer Science (CS) stands as a testament to the intricacies of human intelligence, profoundly advancing the development of artificial intelligence and modern society. However, the current community of large language models (LLMs) overly focuses on benchmarks for analyzing specific foundational skills (e.g. mathematics and code generation), neglecting an all-round evaluation of the computer science field. To bridge this gap, we introduce CS-Bench, the first bilingual (Chinese-English) benchmark dedicated to evaluating the performance of LLMs in computer science. CS-Bench comprises approximately 5K meticulously curated test samples, covering 26 subfields across 4 key areas of computer science, encompassing various task forms and divisions of knowledge and reasoning. Utilizing CS-Bench, we conduct a comprehensive evaluation of over 30 mainstream LLMs, revealing the relationship between CS performance and model scales. We also quantitatively analyze the reasons for failures in existing LLMs and highlight directions for improvements, including knowledge supplementation and CS-specific reasoning. Further cross-capability experiments show a high correlation between LLMs' capabilities in computer science and their abilities in mathematics and coding. Moreover, expert LLMs specialized in mathematics and coding also demonstrate strong performances in several CS subfields. Looking ahead, we envision CS-Bench serving as a cornerstone for LLM applications in the CS field and paving new avenues in assessing LLMs' diverse reasoning capabilities. The CS-Bench data and evaluation code are available at https://github.com/csbench/csbench.

CS-Bench: コンピュータサイエンス習得に向けた大規模言語モデルの包括的ベンチマーク

CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery

要旨

Support