AInstein:评估AI生成方法解决研究问题的可行性
AInstein: Assessing the Feasibility of AI-Generated Approaches to Research Problems
October 6, 2025
作者: Shambhavi Mishra, Gaurav Sahu, Marco Pedersoli, Laurent Charlin, Jose Dolz, Christopher Pal
cs.AI
摘要
大型语言模型(LLMs)在广泛任务中展现出令人瞩目的能力,然而这种成功是否反映了真正的推理能力还是复杂的记忆机制仍不明确。我们提出了AInstein框架,用于测试LLMs是否能够仅凭其预训练的参数知识生成针对AI研究问题的有效解决方案——无需领域特定的微调、检索增强或其他外部辅助。我们的方法从高质量的ICLR 2025投稿中提炼出精简的问题陈述,随后让专门的求解代理通过迭代的批判循环提出并优化技术方案,模拟科学探究中提案、评审与修订的核心循环。我们采用LLM作为评判者的范式,结合结构化评分标准,辅以针对性的人工核查,对1,214篇按接受等级(口头报告、亮点展示、海报展示)分层的ICLR论文进行了评估。性能通过三个指标衡量:成功率(解决方案是否解决了问题?)、再发现度(是否与人类提出的方法一致?)以及新颖性(是否产生了有效且原创的方法?)。结果表明,尽管LLMs能够重新发现可行的解决方案,并偶尔提出创造性的替代方案,但其解决问题的能力仍显脆弱,且对问题表述高度敏感。这些发现首次大规模揭示了LLMs作为自主科学问题解决者的能力边界,既凸显了其潜在优势,也指出了当前的局限性。
English
Large language models (LLMs) demonstrate impressive capabilities across a
wide range of tasks, yet it remains unclear whether such success reflects
genuine reasoning or sophisticated recall. We introduce AInstein, a framework
for testing whether LLMs can generate valid solutions to AI research problems
using only their pretrained parametric knowledge -- without domain-specific
fine-tuning, retrieval augmentation, or other external aids. Our approach
extracts distilled problem statements from high-quality ICLR 2025 submissions,
then tasks specialized solver agents with proposing and refining technical
solutions through iterative critique loops, mimicking the cycles of proposal,
review, and revision central to scientific inquiry. We evaluate AInstein on
1,214 ICLR papers stratified by acceptance tier (Oral, Spotlight, Poster),
using an LLM-as-a-judge paradigm guided by a structured rubric, complemented by
targeted manual checks. Performance is assessed with three metrics: Success
Rate (does the solution address the problem?), Rediscovery (does it align with
human-proposed methods?), and Novelty (does it yield valid, original
approaches?). Our results reveal that while LLMs can rediscover feasible
solutions and occasionally propose creative alternatives, their problem-solving
ability remains fragile and highly sensitive to framing. These findings provide
the first large-scale evidence on the extent to which LLMs can act as
autonomous scientific problem-solvers, highlighting both their latent potential
and their current limitations.