ChatPaper.aiChatPaper

AInstein:评估AI生成方法解决研究问题的可行性

AInstein: Assessing the Feasibility of AI-Generated Approaches to Research Problems

October 6, 2025
作者: Shambhavi Mishra, Gaurav Sahu, Marco Pedersoli, Laurent Charlin, Jose Dolz, Christopher Pal
cs.AI

摘要

大型语言模型(LLMs)在广泛任务中展现出令人瞩目的能力,然而这种成功是否反映了真正的推理能力还是复杂的记忆机制仍不明确。我们提出了AInstein框架,用于测试LLMs是否能够仅凭其预训练的参数知识生成针对AI研究问题的有效解决方案——无需领域特定的微调、检索增强或其他外部辅助。我们的方法从高质量的ICLR 2025投稿中提炼出精简的问题陈述,随后让专门的求解代理通过迭代的批判循环提出并优化技术方案,模拟科学探究中提案、评审与修订的核心循环。我们采用LLM作为评判者的范式,结合结构化评分标准,辅以针对性的人工核查,对1,214篇按接受等级(口头报告、亮点展示、海报展示)分层的ICLR论文进行了评估。性能通过三个指标衡量:成功率(解决方案是否解决了问题?)、再发现度(是否与人类提出的方法一致?)以及新颖性(是否产生了有效且原创的方法?)。结果表明,尽管LLMs能够重新发现可行的解决方案,并偶尔提出创造性的替代方案,但其解决问题的能力仍显脆弱,且对问题表述高度敏感。这些发现首次大规模揭示了LLMs作为自主科学问题解决者的能力边界,既凸显了其潜在优势,也指出了当前的局限性。
English
Large language models (LLMs) demonstrate impressive capabilities across a wide range of tasks, yet it remains unclear whether such success reflects genuine reasoning or sophisticated recall. We introduce AInstein, a framework for testing whether LLMs can generate valid solutions to AI research problems using only their pretrained parametric knowledge -- without domain-specific fine-tuning, retrieval augmentation, or other external aids. Our approach extracts distilled problem statements from high-quality ICLR 2025 submissions, then tasks specialized solver agents with proposing and refining technical solutions through iterative critique loops, mimicking the cycles of proposal, review, and revision central to scientific inquiry. We evaluate AInstein on 1,214 ICLR papers stratified by acceptance tier (Oral, Spotlight, Poster), using an LLM-as-a-judge paradigm guided by a structured rubric, complemented by targeted manual checks. Performance is assessed with three metrics: Success Rate (does the solution address the problem?), Rediscovery (does it align with human-proposed methods?), and Novelty (does it yield valid, original approaches?). Our results reveal that while LLMs can rediscover feasible solutions and occasionally propose creative alternatives, their problem-solving ability remains fragile and highly sensitive to framing. These findings provide the first large-scale evidence on the extent to which LLMs can act as autonomous scientific problem-solvers, highlighting both their latent potential and their current limitations.
PDF64October 8, 2025