为何大语言模型尚未成为科学家：从四项自主研究尝试中获得的启示

摘要

我们报告了一项案例研究：通过将六个大语言智能体映射到科研工作流程各阶段，对自主生成机器学习研究论文进行了四次端到端尝试。其中三次尝试在实施或评估阶段失败，仅有一次成功完成全流程，该成果被要求以人工智能系统为第一作者的实验性首创会议Agents4Science 2025接收，并通过了人类与多智能体联合评审。基于这些尝试，我们总结出六类反复出现的故障模式：倾向于训练数据默认值的偏见、执行压力下的实施偏移、长周期任务中的记忆与语境衰减、无视明显失败而宣告成功的过度兴奋、领域智能不足，以及实验设计中薄弱的科学品味。最后我们讨论了构建更稳健AI科学家系统的四项设计原则，及其对自主科学发现的影响，并公开全部提示词、过程产物与输出结果于https://github.com/Lossfunk/ai-scientist-artefacts-v1。

English

We report a case study of four end-to-end attempts to autonomously generate ML research papers using a pipeline of six LLM agents mapped to stages of the scientific workflow. Of these four, three attempts failed during implementation or evaluation. One completed the pipeline and was accepted to Agents4Science 2025, an experimental inaugural venue that required AI systems as first authors, passing both human and multi-AI review. From these attempts, we document six recurring failure modes: bias toward training data defaults, implementation drift under execution pressure, memory and context degradation across long-horizon tasks, overexcitement that declares success despite obvious failures, insufficient domain intelligence, and weak scientific taste in experimental design. We conclude by discussing four design principles for more robust AI-scientist systems, implications for autonomous scientific discovery, and we release all prompts, artifacts, and outputs at https://github.com/Lossfunk/ai-scientist-artefacts-v1

为何大语言模型尚未成为科学家：从四项自主研究尝试中获得的启示

Why LLMs Aren't Scientists Yet: Lessons from Four Autonomous Research Attempts

摘要

Support