科学推理：解码人工智能创新模式的数据集

摘要

在人工智能创新加速发展的当下，突破性成果背后的智力过程——研究者如何识别研究空白、整合前人工作并产生洞见——仍鲜为人知。科学推理结构化数据的缺失，阻碍了对AI研究智能体的系统性分析与开发。我们推出首个捕捉高质量AI研究背后智力合成过程的Sci-Reasoning数据集：通过社区验证的质量信号与LLM加速、人工校验的流程，追溯NeurIPS、ICML和ICLR（2023-2025）口头报告与焦点论文的关键前驱研究，以结构化形式阐明具体推理链条。分析揭示了15种独特思维模式，其中三种主导策略占比52.7%：空白驱动重构（24.2%）、跨领域融合（18.0%）与表征转换（10.5%）。最具创新性的方法往往融合多种模式：空白驱动重构+表征转换、跨领域融合+表征转换、空白驱动重构+跨领域融合。该数据集支持科学进步的量化研究，并为培养新一代AI研究智能体提供了结构化推理轨迹。

English

While AI innovation accelerates rapidly, the intellectual process behind breakthroughs -- how researchers identify gaps, synthesize prior work, and generate insights -- remains poorly understood. The lack of structured data on scientific reasoning hinders systematic analysis and development of AI research agents. We introduce Sci-Reasoning, the first dataset capturing the intellectual synthesis behind high-quality AI research. Using community-validated quality signals and an LLM-accelerated, human-verified pipeline, we trace Oral and Spotlight papers across NeurIPS, ICML, and ICLR (2023-2025) to its key predecessors, articulating specific reasoning links in a structured format. Our analysis identifies 15 distinct thinking patterns, with three dominant strategies accounting for 52.7%: Gap-Driven Reframing (24.2%), Cross-Domain Synthesis (18.0%), and Representation Shift (10.5%). The most powerful innovation recipes combine multiple patterns: Gap-Driven Reframing + Representation Shift, Cross-Domain Synthesis + Representation Shift, and Gap-Driven Reframing + Cross-Domain Synthesis. This dataset enables quantitative studies of scientific progress and provides structured reasoning trajectories for training the next generation AI research agents.