ChatPaper.aiChatPaper

ScientistOne:透過證據鏈邁向人類等級自主研究

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

May 25, 2026
作者: Rui Meng, Bhavana Dalvi Mishra, Jiefeng Chen, Chun-Liang Li, Palash Goyal, Mihir Parmar, Yiwen Song, Yale Song, Rajarishi Sinha, Parthasarathy Ranganathan, Burak Gokturk, Jinsung Yoon, Tomas Pfister
cs.AI

摘要

自主研究代理能產出競爭性的解決方案與專業水準的稿件,但其成果存在表面評估無法察覺的可驗證性缺失:捏造的引用文獻、無法重現的分數,以及與實作內容不符的方法描述。我們透過三項貢獻來解決此問題。第一,證據鏈架構(Chain-of-Evidence,CoE),此為一項可驗證性框架,要求每一項主張皆須可追溯至其證據來源。第二,ScientistOne,此為一套端對端的自主研究系統,從文獻回顧、解決方案發現到論文撰寫過程中,透過建構方式維持證據鏈。第三,CoE審計(CoE Audit),此為一項事後審計機制,其四項完整性檢查——分數驗證、規格違反、參考文獻驗證與方法程式碼比對——可統一應用於所有系統。在涵蓋五個系統與五項前沿研究任務的75篇論文中,每個基線系統均呈現至少一種系統性失誤模式:捏造參考文獻率達21%,分數驗證僅在42%的論文中通過,方法程式碼比對範圍則落在20%至80%之間。ScientistOne達成零捏造參考文獻(0/337)、完美分數驗證(12/12)以及最高的方法程式碼比對率(14/15),同時在所有五項任務中達到或超越人類專家表現。ScientistOne進一步擴展至涵蓋醫學影像、細粒度辨識、3D感知與語言模型等六項額外任務,於參數高爾夫(Parameter Golf)任務中達到業界最佳表現,並在基線系統完全失敗的MLE-Bench任務中獲得金牌。
English
Autonomous research agents produce competitive solutions and professional-looking manuscripts, yet their outputs contain verifiability failures undetectable by surface-level evaluation: fabricated citations, unreproducible scores, and method descriptions that diverge from the implementation. We address this through three contributions. First, Chain-of-Evidence (CoE), a verifiability framework requiring every claim to be traceable to its evidence source. Second, ScientistOne, an end-to-end autonomous research system that maintains evidence chains by construction throughout literature review, solution discovery, and paper writing. Third, CoE Audit, a post-hoc audit whose four integrity checks -- score verification, specification violation, reference verification, and method-code alignment -- apply uniformly to all systems. Across 75 papers spanning five systems and five frontier research tasks, every baseline exhibits at least one systematic failure mode: hallucinated reference rates reach 21%, score verification passes in as few as 42% of papers, and method-code alignment ranges from 20% to 80%. ScientistOne achieves zero hallucinated references (0/337), perfect score verification (12/12), and the highest method-code alignment (14/15), while matching or exceeding human expert performance on all five tasks. ScientistOne further generalizes to six additional tasks spanning medical imaging, fine-grained recognition, 3D perception, and language modeling, achieving state-of-the-art on Parameter Golf and gold medals on MLE-Bench tasks where baselines fail entirely.