ScientistOne: 通过证据链迈向人类水平的自主研究

摘要

自主研究智能体能够生成有竞争力的解决方案和专业水准的手稿，但其输出中却存在表面评估难以察觉的可验证性缺陷：伪造引用、不可复现的分数，以及与方法实现不符的描述。针对这一问题，我们通过三项贡献加以解决。首先，提出证据链框架，要求每项主张都能追溯至其证据来源。其次，开发科学家壹号系统，这是一个端到端的自主研究系统，在文献综述、方案发现和论文撰写全流程中通过设计维持证据链条。第三，建立证据链审计机制，这是一项事后审计方法，其四项完整性检查——分数验证、规范违背、引用验证与方法-代码一致性——统一适用于所有系统。在涵盖五个系统和五项前沿研究任务的75篇论文中，每个基线系统均表现出至少一种系统性失效模式：幻影引用率高达21%，通过分数验证的论文仅占42%，方法-代码一致性介于20%至80%之间。而科学家壹号系统实现了零幻影引用（0/337）、完美的分数验证（12/12）以及最高的方法-代码一致性（14/15），同时在全部五项任务上达到或超越人类专家表现。此外，科学家壹号系统还成功泛化至医学影像、细粒度识别、三维感知及语言建模等六项额外任务，在参数高尔夫任务中取得最佳结果，并在基线系统完全失败的机器学习基准挑战任务中斩获金牌。

English

Autonomous research agents produce competitive solutions and professional-looking manuscripts, yet their outputs contain verifiability failures undetectable by surface-level evaluation: fabricated citations, unreproducible scores, and method descriptions that diverge from the implementation. We address this through three contributions. First, Chain-of-Evidence (CoE), a verifiability framework requiring every claim to be traceable to its evidence source. Second, ScientistOne, an end-to-end autonomous research system that maintains evidence chains by construction throughout literature review, solution discovery, and paper writing. Third, CoE Audit, a post-hoc audit whose four integrity checks -- score verification, specification violation, reference verification, and method-code alignment -- apply uniformly to all systems. Across 75 papers spanning five systems and five frontier research tasks, every baseline exhibits at least one systematic failure mode: hallucinated reference rates reach 21%, score verification passes in as few as 42% of papers, and method-code alignment ranges from 20% to 80%. ScientistOne achieves zero hallucinated references (0/337), perfect score verification (12/12), and the highest method-code alignment (14/15), while matching or exceeding human expert performance on all five tasks. ScientistOne further generalizes to six additional tasks spanning medical imaging, fine-grained recognition, 3D perception, and language modeling, achieving state-of-the-art on Parameter Golf and gold medals on MLE-Bench tasks where baselines fail entirely.