AInstein：評估人工智慧生成方法解決研究問題的可行性

摘要

大型語言模型（LLMs）在多種任務中展現出令人印象深刻的能力，然而，這種成功是否反映了真正的推理能力還是精密的記憶召回，尚不明確。我們引入了AInstein框架，用於測試LLMs是否能夠僅憑其預訓練的參數知識生成對AI研究問題的有效解決方案——無需領域特定的微調、檢索增強或其他外部輔助。我們的方法從高質量的ICLR 2025投稿中提取精煉的問題陳述，然後讓專門的求解代理通過迭代的批判循環提出並完善技術解決方案，模仿科學探究中提案、審查和修訂的核心循環。我們在1,214篇按接受層級（口頭報告、亮點展示、海報展示）分層的ICLR論文上評估AInstein，採用LLM作為評判者的範式，並輔以結構化評分標準，以及針對性的手動檢查。性能通過三個指標進行評估：成功率（解決方案是否解決了問題？）、再發現率（它是否與人類提出的方法一致？）和新穎性（它是否產生了有效且原創的方法？）。我們的結果顯示，雖然LLMs能夠重新發現可行的解決方案並偶爾提出創造性的替代方案，但它們的問題解決能力仍然脆弱且對問題框架高度敏感。這些發現首次提供了大規模證據，表明LLMs能夠作為自主科學問題解決者的程度，既揭示了它們的潛在潛力，也指出了它們當前的局限性。

English

Large language models (LLMs) demonstrate impressive capabilities across a wide range of tasks, yet it remains unclear whether such success reflects genuine reasoning or sophisticated recall. We introduce AInstein, a framework for testing whether LLMs can generate valid solutions to AI research problems using only their pretrained parametric knowledge -- without domain-specific fine-tuning, retrieval augmentation, or other external aids. Our approach extracts distilled problem statements from high-quality ICLR 2025 submissions, then tasks specialized solver agents with proposing and refining technical solutions through iterative critique loops, mimicking the cycles of proposal, review, and revision central to scientific inquiry. We evaluate AInstein on 1,214 ICLR papers stratified by acceptance tier (Oral, Spotlight, Poster), using an LLM-as-a-judge paradigm guided by a structured rubric, complemented by targeted manual checks. Performance is assessed with three metrics: Success Rate (does the solution address the problem?), Rediscovery (does it align with human-proposed methods?), and Novelty (does it yield valid, original approaches?). Our results reveal that while LLMs can rediscover feasible solutions and occasionally propose creative alternatives, their problem-solving ability remains fragile and highly sensitive to framing. These findings provide the first large-scale evidence on the extent to which LLMs can act as autonomous scientific problem-solvers, highlighting both their latent potential and their current limitations.

AInstein：評估人工智慧生成方法解決研究問題的可行性

AInstein: Assessing the Feasibility of AI-Generated Approaches to Research Problems

摘要

Support