ScientistOne: 증거 연쇄를 통한 인간 수준의 자율 연구를 향하여

초록

자율 연구 에이전트는 경쟁력 있는 해결책과 전문가 수준의 원고를 생성하지만, 그 결과물에는 표면적 평가로는 탐지할 수 없는 검증 가능성 실패(조작된 인용, 재현 불가능한 점수, 구현과 다른 방법론 설명)가 포함되어 있다. 본 연구는 세 가지 기여를 통해 이를 해결한다. 첫째, 모든 주장이 증거 출처까지 추적 가능해야 한다는 검증 가능성 프레임워크인 증거 사슬(Chain-of-Evidence, CoE)을 제안한다. 둘째, 문헌 검토, 해결책 발견, 논문 작성 전반에 걸쳐 증거 사슬을 구조적으로 유지하는 종단간 자율 연구 시스템인 ScientistOne을 제안한다. 셋째, 사후 감사 도구인 CoE Audit을 제안하며, 이는 점수 검증, 규격 위반, 참조문헌 검증, 방법론-코드 정합성의 네 가지 무결성 검사를 모든 시스템에 동일하게 적용한다. 다섯 가지 시스템과 다섯 가지 최전선 연구 과제에 걸친 75편의 논문 분석 결과, 모든 기준 시스템은 적어도 하나의 체계적 실패 모드를 보였다: 환각 참조문헌 비율은 21%에 달했고, 점수 검증 통과율은 42%에 불과한 논문도 있었으며, 방법론-코드 정합성은 20%에서 80%까지 분포했다. 반면 ScientistOne은 환각 참조문헌이 0건(0/337), 점수 검증 완벽 통과(12/12), 가장 높은 방법론-코드 정합성(14/15)을 달성했으며, 다섯 가지 모든 과제에서 인간 전문가 성능과 동등하거나 이를 초과했다. 또한 ScientistOne은 의료 영상, 세부 인식, 3차원 인식, 언어 모델링을 포함한 여섯 가지 추가 과제로 일반화되었으며, 기준 시스템이 완전히 실패하는 MLE-Bench 과제에서 금메달을 획득하고 Parameter Golf에서 최첨단 성능을 달성했다.

English

Autonomous research agents produce competitive solutions and professional-looking manuscripts, yet their outputs contain verifiability failures undetectable by surface-level evaluation: fabricated citations, unreproducible scores, and method descriptions that diverge from the implementation. We address this through three contributions. First, Chain-of-Evidence (CoE), a verifiability framework requiring every claim to be traceable to its evidence source. Second, ScientistOne, an end-to-end autonomous research system that maintains evidence chains by construction throughout literature review, solution discovery, and paper writing. Third, CoE Audit, a post-hoc audit whose four integrity checks -- score verification, specification violation, reference verification, and method-code alignment -- apply uniformly to all systems. Across 75 papers spanning five systems and five frontier research tasks, every baseline exhibits at least one systematic failure mode: hallucinated reference rates reach 21%, score verification passes in as few as 42% of papers, and method-code alignment ranges from 20% to 80%. ScientistOne achieves zero hallucinated references (0/337), perfect score verification (12/12), and the highest method-code alignment (14/15), while matching or exceeding human expert performance on all five tasks. ScientistOne further generalizes to six additional tasks spanning medical imaging, fine-grained recognition, 3D perception, and language modeling, achieving state-of-the-art on Parameter Golf and gold medals on MLE-Bench tasks where baselines fail entirely.