ChatPaper.aiChatPaper

ATLAS:面向前沿科学推理的高难度多学科基准测试

ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning

November 18, 2025
作者: Hongwei Liu, Junnan Liu, Shudong Liu, Haodong Duan, Yuqiang Li, Mao Su, Xiaohong Liu, Guangtao Zhai, Xinyu Fang, Qianhong Ma, Taolin Zhang, Zihan Ma, Yufeng Zhao, Peiheng Zhou, Linchen Xiao, Wenlong Zhang, Shijie Zhou, Xingjian Ma, Siqi Sun, Jiaye Ge, Meng Li, Yuhong Liu, Jianxin Dong, Jiaying Li, Hui Wu, Hanwen Liang, Jintai Lin, Yanting Wang, Jie Dong, Tong Zhu, Tianfan Fu, Conghui He, Qi Zhang, Songyang Zhang, Lei Bai, Kai Chen
cs.AI

摘要

随着大语言模型(LLMs)的快速发展,现有基准测试在评估前沿模型时已出现性能饱和现象,难以有效区分其能力差异。同时,当前的高难度基准往往存在学科覆盖狭窄、答案形式过于简化以及易受数据污染等问题,导致与真实科学探究之间存在保真度差距。为应对这些挑战,我们推出ATLAS(面向通用人工智能的科学逻辑应用测试平台)——一个由约800道原创题目构成的大规模、高难度、跨学科评估体系。该平台由领域专家(博士及以上级别)开发,涵盖数学、物理、化学、生物、计算机科学、地球科学和材料科学七大核心学科,其核心特性包括:(1)高原创性与抗污染性,所有题目均为全新创建或深度改编,杜绝测试数据泄露;(2)跨学科导向,重点评估模型整合多学科知识进行跨域推理的能力;(3)高保真答案设计,摒弃简单选择题,强调需要多步推理、含LaTeX格式数学表达式的开放式复杂答案;(4)严格质量控制,采用多阶段专家评审与对抗测试机制,确保题目难度、科学性与准确性。我们还提出采用LLM评审团的新型评估范式,实现对复杂答案的自动化精细评估。在主流模型上的初步实验表明,ATLAS能有效区分其高级科学推理能力。我们计划将ATLAS发展为长期开放、社区驱动的平台,为通往通用人工智能的进展提供可靠"标尺"。
English
The rapid advancement of Large Language Models (LLMs) has led to performance saturation on many established benchmarks, questioning their ability to distinguish frontier models. Concurrently, existing high-difficulty benchmarks often suffer from narrow disciplinary focus, oversimplified answer formats, and vulnerability to data contamination, creating a fidelity gap with real-world scientific inquiry. To address these challenges, we introduce ATLAS (AGI-Oriented Testbed for Logical Application in Science), a large-scale, high-difficulty, and cross-disciplinary evaluation suite composed of approximately 800 original problems. Developed by domain experts (PhD-level and above), ATLAS spans seven core scientific fields: mathematics, physics, chemistry, biology, computer science, earth science, and materials science. Its key features include: (1) High Originality and Contamination Resistance, with all questions newly created or substantially adapted to prevent test data leakage; (2) Cross-Disciplinary Focus, designed to assess models' ability to integrate knowledge and reason across scientific domains; (3) High-Fidelity Answers, prioritizing complex, open-ended answers involving multi-step reasoning and LaTeX-formatted expressions over simple multiple-choice questions; and (4) Rigorous Quality Control, employing a multi-stage process of expert peer review and adversarial testing to ensure question difficulty, scientific value, and correctness. We also propose a robust evaluation paradigm using a panel of LLM judges for automated, nuanced assessment of complex answers. Preliminary results on leading models demonstrate ATLAS's effectiveness in differentiating their advanced scientific reasoning capabilities. We plan to develop ATLAS into a long-term, open, community-driven platform to provide a reliable "ruler" for progress toward Artificial General Intelligence.
PDF142December 1, 2025