ForeSci:針對前瞻性AI研究判斷的LLM代理評估
ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment
June 4, 2026
作者: Qiuyu Tian, Haojie Yin, Yingce Xia, Youyong Kong, Zequn Liu
cs.AI
摘要
AI研究常需在未來證據出現前做出決策:該攻克哪個瓶頸、該探索哪個方向、或專案應如何定位。我們提出ForeSci,這是一個時間控制基準,用於評估LLM代理是否能根據歷史證據做出此類前瞻性研究判斷。ForeSci包含500項任務,橫跨四個快速發展的AI領域及四類決策。每項任務配備一個與截止點對齊的離線知識庫;截止點後的論文在生成階段被隱藏,僅用於驗證。為避免隨機預測未來事件,任務基於截止點前的分類分支與證據信號推導,且答案生成的骨幹模型選用時間早於任務截止點。我們評估了原生LLM、混合RAG及三種研究代理調適方法,涵蓋四種骨幹模型。結果顯示,明確的證據組織可提升可追溯性與事實支持,但其效益高度依賴決策類型。診斷分析揭示一種反覆出現的「證據-決策脫鉤」現象:代理可能引用相關證據,卻預測錯誤的研究對象。ForeSci將前瞻性AI研究判斷轉化為可控基準,用於評估研究代理作為決策系統的表現。
English
AI research often requires decisions before future evidence exists: which bottleneck to attack, which direction to pursue, or where a project should be positioned. We introduce ForeSci, a temporally controlled benchmark for evaluating whether LLM agents can make such forward-looking research judgements from historical evidence. ForeSci contains 500 tasks across four fast-moving AI domains and four decision families. Each task is paired with a cutoff-aligned offline knowledge base; post-cutoff papers are hidden during generation and used only for validation. To avoid random future-event prediction, tasks are derived from pre-cutoff taxonomy branches and evidence signals, and answer-generation backbones are selected to precede the task cutoffs. We evaluate native LLMs, Hybrid RAG, and three research-agent adaptations across four backbones. Results show that explicit evidence organization improves traceability and factual support, but gains depend strongly on the decision family. Diagnostics reveal a recurring evidence-decision decoupling: agents may cite relevant evidence while forecasting the wrong research object. ForeSci turns forward-looking AI research judgement into a controlled benchmark for evaluating research agents as decision-making systems.