無限科学ジム：科学的分析のための無限生成ベンチマーク

要旨

大規模言語モデルは科学アシスタントとして台頭しつつあるが、経験的データから推論する能力の評価は依然として課題である。公表された研究や人間の注釈に基づくベンチマークは、出版バイアス、既知知識バイアス、ラベルノイズ、そして多大なストレージ要件を引き継いでいる。本論文では、手続き的に生成された科学リポジトリのベンチマークであるInfiniteScienceGymを、検証可能な質問応答タスクと組み合わせて提案する。シミュレータは、シードから、現実的なディレクトリ構造、ファイル、表形式データを含む自己完結型リポジトリを決定論的に生成し、特権的なQAジェネレータが、正確な正解を持つ回答可能な質問と回答不能な質問の両方を生成する。これにより、大規模な静的コーパスを配布することなく、制御された環境下で、証拠に基づく推論、回答保留、ツールを介した分析を評価することが可能となる。InfiniteScienceGymは、公表データセットのみでは評価が困難な盲点や失敗モードに焦点を当てることで、現実の科学ベンチマークを補完する。プロプライエタリモデルとオープンウェイトモデルの双方を評価した結果、全体の精度が45%を超えるモデルはなく、回答不能な質問の認識が主要な弱点であり、より強力なモデルは単により多くのトークンを消費するのではなく、ツールをより効果的に使用する傾向があることが明らかになった。

English

Large language models are emerging as scientific assistants, but evaluating their ability to reason from empirical data remains challenging. Benchmarks derived from published studies and human annotations inherit publication bias, known-knowledge bias, label noise, and substantial storage requirements. We present InfiniteScienceGym, a procedurally generated benchmark of scientific repositories paired with a verifiable question-answering task. From a seed, the simulator deterministically generates a self-contained repository with realistic directory structure, files, and tabular data, and a privileged QA generator produces both answerable and unanswerable questions with exact ground truth. This makes it possible to evaluate evidence-grounded reasoning, abstention, and tool-mediated analysis in a controlled setting without distributing a large static corpus. InfiniteScienceGym complements real scientific benchmarks by targeting blind spots and failure modes that are hard to evaluate using published datasets alone. Evaluating both proprietary and open-weight models, we find that none achieve more than 45% accuracy overall, that recognizing unanswerable questions remains a major weakness, and that stronger models tend to use tools more effectively rather than simply consuming more tokens.

無限科学ジム：科学的分析のための無限生成ベンチマーク

InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis

要旨

Support