LLM能够生成新颖的研究思路吗？一项涵盖100多名自然语言处理研究人员的大规模人类研究

摘要

最近大型语言模型（LLMs）的进展引发了人们对其加速科学发现潜力的乐观情绪，越来越多的研究提出了研究代理人，这些代理人可以自主生成和验证新想法。尽管如此，尚无评估表明LLM系统能够迈出生成新颖、专家级别想法的第一步，更不用说执行整个研究过程了。我们通过建立一个实验设计来评估研究想法生成，同时控制混杂因素，并首次对比了专家自然语言处理研究人员和一个LLM构想代理人。通过招募100多名自然语言处理研究人员撰写新颖想法，并对LLM和人类想法进行盲审，我们得出了关于当前LLM在研究构想方面能力的第一个具有统计学意义的结论：我们发现LLM生成的想法被认为比人类专家想法更具新颖性（p < 0.05），同时在可行性上略显不足。通过仔细研究我们的代理人基线，我们确定了在构建和评估研究代理人时的一些问题，包括LLM自我评估的失败以及它们在生成中缺乏多样性。最后，我们承认即使对专家而言，新颖性的人类判断可能很困难，并提出了一个端到端的研究设计，招募研究人员将这些想法执行成完整项目，使我们能够研究这些新颖性和可行性判断是否会导致研究结果上的有意义差异。

English

Recent advancements in large language models (LLMs) have sparked optimism about their potential to accelerate scientific discovery, with a growing number of works proposing research agents that autonomously generate and validate new ideas. Despite this, no evaluations have shown that LLM systems can take the very first step of producing novel, expert-level ideas, let alone perform the entire research process. We address this by establishing an experimental design that evaluates research idea generation while controlling for confounders and performs the first head-to-head comparison between expert NLP researchers and an LLM ideation agent. By recruiting over 100 NLP researchers to write novel ideas and blind reviews of both LLM and human ideas, we obtain the first statistically significant conclusion on current LLM capabilities for research ideation: we find LLM-generated ideas are judged as more novel (p < 0.05) than human expert ideas while being judged slightly weaker on feasibility. Studying our agent baselines closely, we identify open problems in building and evaluating research agents, including failures of LLM self-evaluation and their lack of diversity in generation. Finally, we acknowledge that human judgements of novelty can be difficult, even by experts, and propose an end-to-end study design which recruits researchers to execute these ideas into full projects, enabling us to study whether these novelty and feasibility judgements result in meaningful differences in research outcome.