AutoResearch AI：迈向AI驱动的研究自动化，赋能科学发现

摘要

人工智能系统正在重塑科学研究，其作用已超越孤立的辅助工具，转向涵盖文献依据确立、假设生成、实验设计、验证评估、报告撰写及修订完善的全流程长周期工作模式。这一转变标志着科研场景下的人工智能正从任务级应用迈向工作流级自动化的新阶段。然而现有系统仍存在显著碎片化特征：在自主性程度、领域覆盖范围、执行环境、验证机制及人类监督模式等方面存在差异，同时面临证据留存性、可重复性、弱方向拒绝机制、溯源追踪能力、跨领域鲁棒性以及负责任科研闭环等核心挑战。本综述通过定义"自动科研"（AutoResearch）这一概念——即人工智能驱动的科学工作流自动化技术演进谱系——来系统审视这些发展。其中，"氛围科研"（Vibe Research）指代人类主导的提示驱动型辅助与人工验证执行模式，而新兴的人工智能主导系统虽能协调发现循环中的更大环节，但尚未实现稳健自主。我们分析研究系统如何在流程中重新分配控制权、证据链、执行机制、验证环节与问责机制，并围绕五个工作流条件构建该领域研究框架：文献与科研依据确立；假设形成与规划；实验执行与工具运用；反馈、验证与同行评议；报告撰写与知识传播。此外，我们系统梳理了人工智能科学家系统、混合主动协作研究框架、基准测试、领域部署案例及开源基础设施。最后，我们提出新颖性、有效性、影响力、可靠性及溯源能力五个评估维度，并论证自动科研的自主性具有领域依赖性——在结构化、可执行且可快速验证的场景中更具可信度，但在具身化、长周期、异质性、伦理敏感性或制度问责性要求较高的情境中仍存在显著局限。

English

Scientific research is being reshaped by AI systems that move beyond isolated assistance toward longer-horizon workflows spanning literature grounding, hypothesis generation, experimentation, validation, reporting, and revision. This shift marks a transition from task-level AI for science to workflow-level research automation. Yet current systems remain fragmented, differing in autonomy, domain scope, execution environment, validation mechanism, and human oversight, while still struggling with evidence preservation, reproducibility, weak-direction rejection, provenance tracking, cross-domain robustness, and accountable scientific closure. This survey examines these developments through AutoResearch, defined as the developmental spectrum of AI-powered scientific workflow automation. Within it, Vibe Research denotes the human-steered region of prompt-based assistance and human-verified execution, whereas emerging AI-led systems coordinate larger portions of the discovery loop without achieving robust autonomy. We analyze how research systems redistribute control, evidence, execution, validation, and accountability across workflows and organize the field around five workflow conditions: literature and research grounding; hypothesis formation and planning; experimentation and tool use; feedback, validation, and review; and reporting and knowledge communication. We further synthesize AI scientist systems, mixed-initiative co-research frameworks, benchmarks, domain deployments, and open-source infrastructures. Finally, we propose five evaluation dimensions--novelty, validity, impact, reliability, and provenance--and show that AutoResearch autonomy is domain-conditioned, being more credible in structured, executable, and rapidly verifiable settings but limited in embodied, delayed, heterogeneous, ethical, or institutionally accountable contexts.