AI 기반 자동 연구: 로드맵 및 사용자 가이드

초록

AI 지원 연구가 새로운 문턱을 넘고 있다. 완전 자동화된 시스템은 이제 15달러 정도의 비용으로 연구 논문을 생성할 수 있으며, 장기 과제 에이전트는 최소한의 인간 입력만으로 실험을 실행하고, 원고를 작성하며, 비평을 시뮬레이션할 수 있다. 그러나 이러한 생산성의 최전선은 더 깊은 진정성 문제를 드러낸다. 과학적 압박 속에서 최전선 LLM조차도 결과를 조작하거나, 숨은 오류를 놓치거나, 참신성을 안정적으로 판단하는 데 실패한다. 2026년 4월까지의 발전 상황을 연구하면서, 우리는 완전한 연구 생애주기에 걸친 AI의 종단 간 분석을 제시하며, 이를 창조(아이디어 생성, 문헌 검토, 코딩 및 실험, 표와 그림), 작성(논문 작성), 검증(동료 검토, 반론 및 수정), 확산(포스터, 슬라이드, 동영상, 소셜 미디어, 프로젝트 페이지, 대화형 에이전트)의 네 가지 인식론적 단계로 구성한다. 우리는 신뢰할 수 있는 지원과 신뢰할 수 없는 자율성 사이에 단계 의존적이고 명확한 경계가 있음을 확인한다. 즉, AI는 구조화되고 검색 기반이며 도구 매개 작업에서 뛰어나지만, 진정으로 새로운 아이디어, 연구 수준의 실험, 과학적 판단에서는 여전히 취약하다. 생성된 아이디어는 종종 구현 후 성능이 저하되고, 연구 코드는 패턴 일치 벤치마크에 크게 뒤처지며, 종단 간 자율 시스템은 아직 주요 학회의 수용 기준에 일관되게 도달하지 못했다. 또한, 더 큰 자동화는 오류 모드를 제거하기보다 오히려 모호하게 만들어, 인간이 주도하는 협업이 가장 신뢰할 수 있는 배치 패러다임임을 보여준다. 마지막으로, 우리는 구조화된 분류 체계, 벤치마크 스위트, 도구 목록, 단계 간 설계 원칙, 실무자 중심의 플레이북을 제공하며, 관련 자료는 프로젝트 페이지에서 유지 관리된다.

English

AI-assisted research is crossing a threshold: fully automated systems can now generate research papers for as little as $15, while long-horizon agents can execute experiments, draft manuscripts, and simulate critique with minimal human input. Yet this productivity frontier exposes a deeper integrity problem: under scientific pressure, even frontier LLMs still fabricate results, miss hidden errors, and fail to judge novelty reliably. Studying developments through April 2026, we present an end-to-end analysis of AI across the complete research lifecycle, organized into four epistemological phases: Creation (idea generation, literature review, coding & experiments, tables & figures), Writing (paper writing), Validation (peer review, rebuttal & revision), and Dissemination (posters, slides, videos, social media, project pages, and interactive agents). We identify a sharp, stage-dependent boundary between reliable assistance and unreliable autonomy: AI excels at structured, retrieval-grounded, and tool-mediated tasks, but remains fragile for genuinely novel ideas, research-level experiments, and scientific judgment. Generated ideas often degrade after implementation, research code lags far behind pattern-matching benchmarks, and end-to-end autonomous systems have not yet consistently reached major-venue acceptance standards. We further show that greater automation can obscure rather than eliminate failure modes, making human-governed collaboration the most credible deployment paradigm. Finally, we provide a structured taxonomy, benchmark suite, and tool inventory, cross-stage design principles, and a practitioner-oriented playbook, with resources maintained at our project page.