AutoResearch AI: 과학적 발견을 위한 AI 기반 연구 자동화를 향하여

초록

AI 시스템이 고립된 지원을 넘어 문헌 기반 설정, 가설 생성, 실험, 검증, 보고 및 수정에 이르는 장기 워크플로우를 포괄함으로써 과학 연구의 형태를 재편하고 있다. 이러한 변화는 과학을 위한 작업 수준의 AI에서 워크플로우 수준의 연구 자동화로의 전환을 의미한다. 그러나 현재의 시스템은 자율성, 도메인 범위, 실행 환경, 검증 메커니즘 및 인간의 감독 측면에서 차이를 보이며 여전히 파편화되어 있고, 증거 보존, 재현성, 약방향 거부, 출처 추적, 교차 도메인 견고성 및 책임 있는 과학적 종결에 어려움을 겪고 있다. 본 조사는 AI 기반 과학 워크플로우 자동화의 발전 스펙트럼으로 정의되는 AutoResearch를 통해 이러한 발전을 검토한다. 그 안에서 Vibe Research는 프롬프트 기반 지원과 인간 검증 실행의 인간 주도 영역을 나타내는 반면, 신흥 AI 주도 시스템은 발견 루프의 더 큰 부분을 조정하지만 강력한 자율성을 달성하지는 못한다. 우리는 연구 시스템이 워크플로우 전반에 걸쳐 제어, 증거, 실행, 검증 및 책임성을 어떻게 재분배하는지 분석하고, 다섯 가지 워크플로우 조건(문헌 및 연구 기반 설정, 가설 형성 및 계획, 실험 및 도구 사용, 피드백·검증·리뷰, 보고 및 지식 전달)을 중심으로 해당 분야를 정리한다. 또한 AI 과학자 시스템, 혼합 주체 공동 연구 프레임워크, 벤치마크, 도메인 배포 및 오픈소스 인프라를 종합한다. 마지막으로, 우리는 다섯 가지 평가 차원(참신성, 타당성, 영향, 신뢰성, 출처)을 제안하고, AutoResearch의 자율성은 도메인 조건에 따라 달라져 구조화되고 실행 가능하며 신속하게 검증 가능한 환경에서는 더 신뢰할 수 있지만, 구현된, 지연된, 이질적이거나 윤리적·제도적 책임이 요구되는 맥락에서는 제한적임을 보여준다.

English

Scientific research is being reshaped by AI systems that move beyond isolated assistance toward longer-horizon workflows spanning literature grounding, hypothesis generation, experimentation, validation, reporting, and revision. This shift marks a transition from task-level AI for science to workflow-level research automation. Yet current systems remain fragmented, differing in autonomy, domain scope, execution environment, validation mechanism, and human oversight, while still struggling with evidence preservation, reproducibility, weak-direction rejection, provenance tracking, cross-domain robustness, and accountable scientific closure. This survey examines these developments through AutoResearch, defined as the developmental spectrum of AI-powered scientific workflow automation. Within it, Vibe Research denotes the human-steered region of prompt-based assistance and human-verified execution, whereas emerging AI-led systems coordinate larger portions of the discovery loop without achieving robust autonomy. We analyze how research systems redistribute control, evidence, execution, validation, and accountability across workflows and organize the field around five workflow conditions: literature and research grounding; hypothesis formation and planning; experimentation and tool use; feedback, validation, and review; and reporting and knowledge communication. We further synthesize AI scientist systems, mixed-initiative co-research frameworks, benchmarks, domain deployments, and open-source infrastructures. Finally, we propose five evaluation dimensions--novelty, validity, impact, reliability, and provenance--and show that AutoResearch autonomy is domain-conditioned, being more credible in structured, executable, and rapidly verifiable settings but limited in embodied, delayed, heterogeneous, ethical, or institutionally accountable contexts.