无限科学竞技场：面向科学分析的无限生成式基准框架

摘要

大型语言模型正逐渐成为科研助手，但其基于实证数据推理能力的评估仍具挑战。基于已发表研究和人工标注的基准数据集存在发表偏倚、已知知识偏倚、标签噪声及巨大存储需求等固有局限。我们提出InfiniteScienceGym——一个通过程序化生成的科学知识库基准平台，配套可验证的问答任务。该模拟器从初始种子确定性地生成包含逼真目录结构、文件及表格数据的自包含知识库，并由特权问答生成器同时产生可回答与不可回答问题及其精确答案。这种设计使得在受控环境中评估证据驱动推理、答案弃选及工具辅助分析成为可能，且无需分发大型静态语料库。通过针对传统基准的盲点与失效模式，InfiniteScienceGym弥补了仅依赖已公开数据集的评估缺陷。对专有模型和开源模型的测试表明：当前模型整体准确率最高不超过45%，识别不可回答问题仍是主要薄弱环节，且更强模型倾向于更有效地使用工具而非单纯消耗更多计算标记。

English

Large language models are emerging as scientific assistants, but evaluating their ability to reason from empirical data remains challenging. Benchmarks derived from published studies and human annotations inherit publication bias, known-knowledge bias, label noise, and substantial storage requirements. We present InfiniteScienceGym, a procedurally generated benchmark of scientific repositories paired with a verifiable question-answering task. From a seed, the simulator deterministically generates a self-contained repository with realistic directory structure, files, and tabular data, and a privileged QA generator produces both answerable and unanswerable questions with exact ground truth. This makes it possible to evaluate evidence-grounded reasoning, abstention, and tool-mediated analysis in a controlled setting without distributing a large static corpus. InfiniteScienceGym complements real scientific benchmarks by targeting blind spots and failure modes that are hard to evaluate using published datasets alone. Evaluating both proprietary and open-weight models, we find that none achieve more than 45% accuracy overall, that recognizing unanswerable questions remains a major weakness, and that stronger models tend to use tools more effectively rather than simply consuming more tokens.

无限科学竞技场：面向科学分析的无限生成式基准框架

InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis

摘要

Support