ForeSci：面向前瞻性AI研究判断的LLM智能体评估

摘要

AI研究往往需要在尚未获得未来证据时做出决策：攻击哪个瓶颈、探索哪个方向，或项目应如何定位。我们提出ForeSci——一个时间可控的基准测试，用于评估LLM智能体能否基于历史证据做出这类前瞻性研究判断。ForeSci包含覆盖四个快速发展的AI领域和四种决策类型的500项任务。每项任务配有时间节点对齐的离线知识库；截止日期后的论文在生成阶段被隐藏，仅用于验证。为避免随机预测未来事件，任务基于截止日期前的分类分支和证据信号构建，且答案生成主干的选择先于任务截止时间。我们评估了原生LLM、混合RAG以及四种主干模型下的三种研究智能体适配方案。结果表明，显式证据组织提升了可追溯性和事实支撑能力，但提升效果高度依赖决策类型。诊断揭示出反复出现的"证据-决策解耦"现象：智能体在预测错误的研究对象时可能引用相关证据。ForeSci将前瞻性AI研究判断转化为可控基准，用于评估研究智能体作为决策系统的表现。

English

AI research often requires decisions before future evidence exists: which bottleneck to attack, which direction to pursue, or where a project should be positioned. We introduce ForeSci, a temporally controlled benchmark for evaluating whether LLM agents can make such forward-looking research judgements from historical evidence. ForeSci contains 500 tasks across four fast-moving AI domains and four decision families. Each task is paired with a cutoff-aligned offline knowledge base; post-cutoff papers are hidden during generation and used only for validation. To avoid random future-event prediction, tasks are derived from pre-cutoff taxonomy branches and evidence signals, and answer-generation backbones are selected to precede the task cutoffs. We evaluate native LLMs, Hybrid RAG, and three research-agent adaptations across four backbones. Results show that explicit evidence organization improves traceability and factual support, but gains depend strongly on the decision family. Diagnostics reveal a recurring evidence-decision decoupling: agents may cite relevant evidence while forecasting the wrong research object. ForeSci turns forward-looking AI research judgement into a controlled benchmark for evaluating research agents as decision-making systems.