AInstein: 연구 문제에 대한 AI 생성 접근법의 실현 가능성 평가

초록

대규모 언어 모델(LLMs)은 다양한 작업에서 인상적인 능력을 보여주지만, 이러한 성공이 진정한 추론 능력을 반영하는지 아니면 정교한 기억 능력을 반영하는지는 여전히 불분명합니다. 우리는 AInstein이라는 프레임워크를 소개합니다. 이 프레임워크는 LLMs가 도메인별 미세 조정, 검색 보강 또는 기타 외부 도움 없이 사전 학습된 파라미터 지식만을 사용하여 AI 연구 문제에 대한 유효한 해결책을 생성할 수 있는지를 테스트합니다. 우리의 접근 방식은 고품질 ICLR 2025 제출물에서 정제된 문제 설명을 추출한 후, 전문적인 솔버 에이전트가 제안과 비평의 반복적인 루프를 통해 기술적 해결책을 제안하고 개선하도록 합니다. 이는 과학적 탐구의 핵심인 제안, 검토, 수정의 사이클을 모방한 것입니다. 우리는 AInstein을 1,214개의 ICLR 논문에 대해 평가하며, 이 논문들은 수락 등급(Oral, Spotlight, Poster)에 따라 계층화되었습니다. 평가는 구조화된 루브릭에 따라 LLM-as-a-judge 패러다임을 사용하며, 이는 대상 수동 검사로 보완됩니다. 성능은 세 가지 메트릭으로 평가됩니다: 성공률(해결책이 문제를 해결하는가?), 재발견(인간이 제안한 방법과 일치하는가?), 그리고 독창성(유효하고 독창적인 접근법을 제공하는가?). 우리의 결과는 LLMs가 실행 가능한 해결책을 재발견하고 때로는 창의적인 대안을 제안할 수 있지만, 문제 해결 능력은 여전히 취약하며 문제의 프레이밍에 매우 민감하다는 것을 보여줍니다. 이러한 발견은 LLMs가 자율적인 과학적 문제 해결자로 작용할 수 있는 정도에 대한 첫 번째 대규모 증거를 제공하며, 그들의 잠재력과 현재의 한계를 강조합니다.

English

Large language models (LLMs) demonstrate impressive capabilities across a wide range of tasks, yet it remains unclear whether such success reflects genuine reasoning or sophisticated recall. We introduce AInstein, a framework for testing whether LLMs can generate valid solutions to AI research problems using only their pretrained parametric knowledge -- without domain-specific fine-tuning, retrieval augmentation, or other external aids. Our approach extracts distilled problem statements from high-quality ICLR 2025 submissions, then tasks specialized solver agents with proposing and refining technical solutions through iterative critique loops, mimicking the cycles of proposal, review, and revision central to scientific inquiry. We evaluate AInstein on 1,214 ICLR papers stratified by acceptance tier (Oral, Spotlight, Poster), using an LLM-as-a-judge paradigm guided by a structured rubric, complemented by targeted manual checks. Performance is assessed with three metrics: Success Rate (does the solution address the problem?), Rediscovery (does it align with human-proposed methods?), and Novelty (does it yield valid, original approaches?). Our results reveal that while LLMs can rediscover feasible solutions and occasionally propose creative alternatives, their problem-solving ability remains fragile and highly sensitive to framing. These findings provide the first large-scale evidence on the extent to which LLMs can act as autonomous scientific problem-solvers, highlighting both their latent potential and their current limitations.

AInstein: 연구 문제에 대한 AI 생성 접근법의 실현 가능성 평가

AInstein: Assessing the Feasibility of AI-Generated Approaches to Research Problems

초록

Support