ForeSci：針對前瞻性AI研究判斷的LLM代理評估

摘要

AI研究常需在未來證據出現前做出決策：該攻克哪個瓶頸、該探索哪個方向、或專案應如何定位。我們提出ForeSci，這是一個時間控制基準，用於評估LLM代理是否能根據歷史證據做出此類前瞻性研究判斷。ForeSci包含500項任務，橫跨四個快速發展的AI領域及四類決策。每項任務配備一個與截止點對齊的離線知識庫；截止點後的論文在生成階段被隱藏，僅用於驗證。為避免隨機預測未來事件，任務基於截止點前的分類分支與證據信號推導，且答案生成的骨幹模型選用時間早於任務截止點。我們評估了原生LLM、混合RAG及三種研究代理調適方法，涵蓋四種骨幹模型。結果顯示，明確的證據組織可提升可追溯性與事實支持，但其效益高度依賴決策類型。診斷分析揭示一種反覆出現的「證據－決策脫鉤」現象：代理可能引用相關證據，卻預測錯誤的研究對象。ForeSci將前瞻性AI研究判斷轉化為可控基準，用於評估研究代理作為決策系統的表現。

English

AI research often requires decisions before future evidence exists: which bottleneck to attack, which direction to pursue, or where a project should be positioned. We introduce ForeSci, a temporally controlled benchmark for evaluating whether LLM agents can make such forward-looking research judgements from historical evidence. ForeSci contains 500 tasks across four fast-moving AI domains and four decision families. Each task is paired with a cutoff-aligned offline knowledge base; post-cutoff papers are hidden during generation and used only for validation. To avoid random future-event prediction, tasks are derived from pre-cutoff taxonomy branches and evidence signals, and answer-generation backbones are selected to precede the task cutoffs. We evaluate native LLMs, Hybrid RAG, and three research-agent adaptations across four backbones. Results show that explicit evidence organization improves traceability and factual support, but gains depend strongly on the decision family. Diagnostics reveal a recurring evidence-decision decoupling: agents may cite relevant evidence while forecasting the wrong research object. ForeSci turns forward-looking AI research judgement into a controlled benchmark for evaluating research agents as decision-making systems.