ForeSci: 미래 지향적 AI 연구 판단을 위한 LLM 에이전트 평가

초록

AI 연구는 종종 미래의 증거가 존재하기 전에 결정을 요구한다: 어떤 병목을 공략할지, 어떤 방향을 추구할지, 또는 프로젝트를 어디에 위치시킬지 등이다. 본 연구에서는 LLM 에이전트가 과거 증거로부터 이러한 미래지향적 연구 판단을 내릴 수 있는지 평가하기 위한 시간 통제 벤치마크인 ForeSci를 소개한다. ForeSci는 급변하는 네 가지 AI 도메인과 네 가지 의사결정 유형에 걸쳐 500개의 태스크를 포함한다. 각 태스크는 컷오프 정렬 오프라인 지식 베이스와 쌍을 이루며, 컷오프 이후의 논문은 생성 과정에서 숨겨지고 검증에만 사용된다. 무작위 미래 사건 예측을 방지하기 위해 태스크는 컷오프 이전 분류 체계 가지와 증거 신호로부터 도출되며, 답변 생성 백본은 태스크 컷오프보다 앞서도록 선택된다. 우리는 네 가지 백본에 걸쳐 네이티브 LLM, 하이브리드 RAG, 세 가지 연구 에이전트 적응형을 평가한다. 결과는 명시적 증거 구성이 추적 가능성과 사실적 지원을 개선하지만, 그 효과는 의사결정 유형에 크게 의존함을 보여준다. 진단 결과 반복되는 증거-의사결정 분리 현상이 발견된다: 에이전트가 관련 증거를 인용하면서도 잘못된 연구 대상을 예측할 수 있다. ForeSci는 미래지향적 AI 연구 판단을 연구 에이전트를 의사결정 시스템으로 평가하기 위한 통제된 벤치마크로 전환한다.

English

AI research often requires decisions before future evidence exists: which bottleneck to attack, which direction to pursue, or where a project should be positioned. We introduce ForeSci, a temporally controlled benchmark for evaluating whether LLM agents can make such forward-looking research judgements from historical evidence. ForeSci contains 500 tasks across four fast-moving AI domains and four decision families. Each task is paired with a cutoff-aligned offline knowledge base; post-cutoff papers are hidden during generation and used only for validation. To avoid random future-event prediction, tasks are derived from pre-cutoff taxonomy branches and evidence signals, and answer-generation backbones are selected to precede the task cutoffs. We evaluate native LLMs, Hybrid RAG, and three research-agent adaptations across four backbones. Results show that explicit evidence organization improves traceability and factual support, but gains depend strongly on the decision family. Diagnostics reveal a recurring evidence-decision decoupling: agents may cite relevant evidence while forecasting the wrong research object. ForeSci turns forward-looking AI research judgement into a controlled benchmark for evaluating research agents as decision-making systems.