ChatPaper.aiChatPaper

ScientistOne: 通过证据链迈向人类水平的自主研究

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

May 25, 2026
作者: Rui Meng, Bhavana Dalvi Mishra, Jiefeng Chen, Chun-Liang Li, Palash Goyal, Mihir Parmar, Yiwen Song, Yale Song, Rajarishi Sinha, Parthasarathy Ranganathan, Burak Gokturk, Jinsung Yoon, Tomas Pfister
cs.AI

摘要

自主研究智能体能够生成有竞争力的解决方案和专业水准的手稿,但其输出中却存在表面评估难以察觉的可验证性缺陷:伪造引用、不可复现的分数,以及与方法实现不符的描述。针对这一问题,我们通过三项贡献加以解决。首先,提出证据链框架,要求每项主张都能追溯至其证据来源。其次,开发科学家壹号系统,这是一个端到端的自主研究系统,在文献综述、方案发现和论文撰写全流程中通过设计维持证据链条。第三,建立证据链审计机制,这是一项事后审计方法,其四项完整性检查——分数验证、规范违背、引用验证与方法-代码一致性——统一适用于所有系统。在涵盖五个系统和五项前沿研究任务的75篇论文中,每个基线系统均表现出至少一种系统性失效模式:幻影引用率高达21%,通过分数验证的论文仅占42%,方法-代码一致性介于20%至80%之间。而科学家壹号系统实现了零幻影引用(0/337)、完美的分数验证(12/12)以及最高的方法-代码一致性(14/15),同时在全部五项任务上达到或超越人类专家表现。此外,科学家壹号系统还成功泛化至医学影像、细粒度识别、三维感知及语言建模等六项额外任务,在参数高尔夫任务中取得最佳结果,并在基线系统完全失败的机器学习基准挑战任务中斩获金牌。
English
Autonomous research agents produce competitive solutions and professional-looking manuscripts, yet their outputs contain verifiability failures undetectable by surface-level evaluation: fabricated citations, unreproducible scores, and method descriptions that diverge from the implementation. We address this through three contributions. First, Chain-of-Evidence (CoE), a verifiability framework requiring every claim to be traceable to its evidence source. Second, ScientistOne, an end-to-end autonomous research system that maintains evidence chains by construction throughout literature review, solution discovery, and paper writing. Third, CoE Audit, a post-hoc audit whose four integrity checks -- score verification, specification violation, reference verification, and method-code alignment -- apply uniformly to all systems. Across 75 papers spanning five systems and five frontier research tasks, every baseline exhibits at least one systematic failure mode: hallucinated reference rates reach 21%, score verification passes in as few as 42% of papers, and method-code alignment ranges from 20% to 80%. ScientistOne achieves zero hallucinated references (0/337), perfect score verification (12/12), and the highest method-code alignment (14/15), while matching or exceeding human expert performance on all five tasks. ScientistOne further generalizes to six additional tasks spanning medical imaging, fine-grained recognition, 3D perception, and language modeling, achieving state-of-the-art on Parameter Golf and gold medals on MLE-Bench tasks where baselines fail entirely.