ScientistOne: 証拠連鎖による人間水準の自律的研究に向けて

要旨

自律型研究エージェントは競争力のある解決策と専門的な論文原稿を生成するが、その出力には表面的な評価では検出不可能な検証可能性の欠陥（捏造された引用、再現不可能なスコア、実装と乖離した手法の記述）が含まれている。我々は3つの貢献によりこれに対処する。第一に、エビデンス連鎖（Chain-of-Evidence, CoE）——各主張がその根拠ソースまで追跡可能であることを要求する検証可能性フレームワーク。第二に、ScientistOne——文献レビュー、解決策の発見、論文執筆にわたって、構築上エビデンス連鎖を維持するエンドツーエンドの自律型研究システム。第三に、CoE監査（CoE Audit）——スコア検証、仕様違反、参考文献検証、手法・コード整合性の4つの整合性チェックをすべてのシステムに一律に適用する事後監査。5つのシステムと5つの最先端研究タスクにわたる75本の論文において、すべてのベースラインが少なくとも1つの体系的な障害モードを示す：捏造された参考文献の割合は21％に達し、スコア検証の合格率はわずか42％、手法・コード整合性は20％から80％の範囲である。ScientistOneは捏造された参考文献ゼロ（0/337）、完璧なスコア検証（12/12）、最も高い手法・コード整合性（14/15）を達成し、5つのタスクすべてにおいて人間専門家の性能に匹敵またはそれを上回る。さらにScientistOneは、医用画像、細粒度認識、3D知覚、言語モデリングにわたる6つの追加タスクに一般化し、Parameter Golfで最先端を達成し、ベースラインが完全に失敗するMLE-Benchタスクで金メダルを獲得する。

English

Autonomous research agents produce competitive solutions and professional-looking manuscripts, yet their outputs contain verifiability failures undetectable by surface-level evaluation: fabricated citations, unreproducible scores, and method descriptions that diverge from the implementation. We address this through three contributions. First, Chain-of-Evidence (CoE), a verifiability framework requiring every claim to be traceable to its evidence source. Second, ScientistOne, an end-to-end autonomous research system that maintains evidence chains by construction throughout literature review, solution discovery, and paper writing. Third, CoE Audit, a post-hoc audit whose four integrity checks -- score verification, specification violation, reference verification, and method-code alignment -- apply uniformly to all systems. Across 75 papers spanning five systems and five frontier research tasks, every baseline exhibits at least one systematic failure mode: hallucinated reference rates reach 21%, score verification passes in as few as 42% of papers, and method-code alignment ranges from 20% to 80%. ScientistOne achieves zero hallucinated references (0/337), perfect score verification (12/12), and the highest method-code alignment (14/15), while matching or exceeding human expert performance on all five tasks. ScientistOne further generalizes to six additional tasks spanning medical imaging, fine-grained recognition, 3D perception, and language modeling, achieving state-of-the-art on Parameter Golf and gold medals on MLE-Bench tasks where baselines fail entirely.