无限科学训练场：一个面向科学分析的无边界、程序化生成基准平台

摘要

大型语言模型正逐渐成为科研助手，但评估其基于实证数据进行推理的能力仍具挑战。基于已发表研究和人工标注的基准数据集存在发表偏倚、已知知识偏倚、标签噪声及巨大存储需求等固有局限。我们提出InfiniteScienceGym——一个程序化生成的科学知识库基准平台，配备可验证的问答任务。该模拟器通过初始种子确定性地生成包含逼真目录结构、文件及表格数据的自包含知识库，并由特权问答生成器同时生成可回答与不可回答问题及其精确答案。这种设计使得在受控环境中评估证据驱动推理、拒答能力及工具辅助分析成为可能，且无需分发大规模静态语料库。通过针对传统基准难以评估的盲点与失效模式，InfiniteScienceGym与真实科学基准形成互补。在对专有模型和开源模型的评估中，我们发现所有模型总体准确率均未超过45%，识别不可回答问题仍是主要薄弱环节，且更强模型倾向于更有效地使用工具而非单纯消耗更多计算标记。

English

Large language models are emerging as scientific assistants, but evaluating their ability to reason from empirical data remains challenging. Benchmarks derived from published studies and human annotations inherit publication bias, known-knowledge bias, label noise, and substantial storage requirements. We present InfiniteScienceGym, a procedurally generated benchmark of scientific repositories paired with a verifiable question-answering task. From a seed, the simulator deterministically generates a self-contained repository with realistic directory structure, files, and tabular data, and a privileged QA generator produces both answerable and unanswerable questions with exact ground truth. This makes it possible to evaluate evidence-grounded reasoning, abstention, and tool-mediated analysis in a controlled setting without distributing a large static corpus. InfiniteScienceGym complements real scientific benchmarks by targeting blind spots and failure modes that are hard to evaluate using published datasets alone. Evaluating both proprietary and open-weight models, we find that none achieve more than 45% accuracy overall, that recognizing unanswerable questions remains a major weakness, and that stronger models tend to use tools more effectively rather than simply consuming more tokens.