为何大语言模型尚未成为科学家:四项自主研究尝试的启示
Why LLMs Aren't Scientists Yet: Lessons from Four Autonomous Research Attempts
January 6, 2026
作者: Dhruv Trehan, Paras Chopra
cs.AI
摘要
我们报告了一项案例研究:通过将六个大语言模型智能体映射至科研工作流程各阶段,进行了四次端到端自主生成机器学习研究论文的尝试。其中三次尝试在实施或评估阶段失败,有一次成功完成全流程,并被要求人工智能系统作为第一作者的实验性首创会议Agents4Science 2025接收,同时通过了人类与多智能体联合评审。基于这些尝试,我们记录了六类反复出现的失败模式:对训练数据默认设置的偏向性、执行压力下的实施偏移、长周期任务中的记忆与语境衰减、无视明显错误而宣告成功的过度兴奋、领域智能不足,以及实验设计中薄弱的科学品味。最后我们讨论了构建更稳健AI科学家系统的四项设计原则,分析了其对自主科学发现的影响,并公开了全部提示词、过程产物与输出结果(https://github.com/Lossfunk/ai-scientist-artefacts-v1)。
English
We report a case study of four end-to-end attempts to autonomously generate ML research papers using a pipeline of six LLM agents mapped to stages of the scientific workflow. Of these four, three attempts failed during implementation or evaluation. One completed the pipeline and was accepted to Agents4Science 2025, an experimental inaugural venue that required AI systems as first authors, passing both human and multi-AI review. From these attempts, we document six recurring failure modes: bias toward training data defaults, implementation drift under execution pressure, memory and context degradation across long-horizon tasks, overexcitement that declares success despite obvious failures, insufficient domain intelligence, and weak scientific taste in experimental design. We conclude by discussing four design principles for more robust AI-scientist systems, implications for autonomous scientific discovery, and we release all prompts, artifacts, and outputs at https://github.com/Lossfunk/ai-scientist-artefacts-v1