ForeSci: 先見的AI研究判断のためのLLMエージェントの評価

要旨

AI研究では、将来の証拠が得られる前に、どのボトルネックに取り組むか、どの方向性を追求するか、プロジェクトをどのように位置づけるかといった意思決定を迫られることが多い。本稿では、LLMエージェントが過去の証拠からそのような将来的な研究判断を下せるかどうかを評価するための時間制御ベンチマーク「ForeSci」を紹介する。ForeSciは、急速に発展する4つのAI領域と4つの意思決定ファミリーにわたる500のタスクで構成される。各タスクには、カットオフに合わせたオフライン知識ベースが対応付けられており、カットオフ後の論文は生成時には非表示とされ、検証にのみ使用される。ランダムな将来事象予測を避けるため、タスクはカットオフ前の分類体系ブランチと証拠シグナルから導出され、回答生成バックボーンはタスクのカットオフよりも前のものを選択している。我々は、ネイティブLLM、ハイブリッドRAG、および3つの研究エージェント適応を4つのバックボーンにわたって評価した。結果は、明示的な証拠整理がトレーサビリティと事実的裏付けを改善するものの、その効果は意思決定ファミリーに強く依存することを示している。診断により、証拠と判断の乖離が繰り返し観察された。すなわち、エージェントは関連する証拠を引用しながらも、誤った研究対象を予測することがある。ForeSciは、将来的なAI研究判断を制御されたベンチマークへと転換し、研究エージェントを意思決定システムとして評価することを可能にする。

English

AI research often requires decisions before future evidence exists: which bottleneck to attack, which direction to pursue, or where a project should be positioned. We introduce ForeSci, a temporally controlled benchmark for evaluating whether LLM agents can make such forward-looking research judgements from historical evidence. ForeSci contains 500 tasks across four fast-moving AI domains and four decision families. Each task is paired with a cutoff-aligned offline knowledge base; post-cutoff papers are hidden during generation and used only for validation. To avoid random future-event prediction, tasks are derived from pre-cutoff taxonomy branches and evidence signals, and answer-generation backbones are selected to precede the task cutoffs. We evaluate native LLMs, Hybrid RAG, and three research-agent adaptations across four backbones. Results show that explicit evidence organization improves traceability and factual support, but gains depend strongly on the decision family. Diagnostics reveal a recurring evidence-decision decoupling: agents may cite relevant evidence while forecasting the wrong research object. ForeSci turns forward-looking AI research judgement into a controlled benchmark for evaluating research agents as decision-making systems.