MIRAI: 이벤트 예측을 위한 LLM 에이전트 평가

초록

대규모 언어 모델(LLM)의 최근 발전은 LLM 에이전트가 세계 정보를 자율적으로 수집하고 이를 바탕으로 복잡한 문제를 해결하기 위한 추론을 수행할 수 있게 하였습니다. 이러한 능력을 바탕으로, 국제적 사건을 예측하기 위해 LLM 에이전트를 활용하려는 관심이 점차 증가하고 있으며, 이는 국제적 차원에서 의사결정에 영향을 미치고 정책 개발을 형성할 수 있습니다. 그러나 이러한 관심에도 불구하고, LLM 에이전트의 예측 능력과 신뢰성을 엄격하게 평가할 수 있는 벤치마크가 부족한 실정입니다. 이러한 격차를 해결하기 위해, 우리는 국제적 사건의 맥락에서 LLM 에이전트를 시간적 예측자로 체계적으로 평가하기 위한 새로운 벤치마크인 MIRAI를 소개합니다. 우리의 벤치마크는 광범위한 역사적, 구조화된 사건 데이터베이스와 텍스트 뉴스 기사에 접근할 수 있는 도구를 갖춘 에이전트 환경을 특징으로 합니다. 우리는 GDELT 사건 데이터베이스를 신중하게 정리하고 파싱하여 다양한 예측 기간을 가진 관계형 예측 작업 시리즈를 구성함으로써, LLM 에이전트의 단기부터 장기 예측 능력을 평가합니다. 또한, LLM 에이전트가 코드 기반 인터페이스를 통해 다양한 도구를 활용할 수 있도록 API를 구현합니다. 요약하자면, MIRAI는 에이전트의 능력을 세 가지 차원에서 종합적으로 평가합니다: 1) 대규모 글로벌 데이터베이스에서 중요한 정보를 자율적으로 수집하고 통합하는 능력; 2) 도메인 특화 API와 라이브러리를 사용하여 코드를 작성하고 도구를 활용하는 능력; 3) 다양한 형식과 시간대의 역사적 지식을 종합적으로 추론하여 미래 사건을 정확하게 예측하는 능력. 종합적인 벤치마킹을 통해, 우리는 국제적 사건 예측에서 LLM 에이전트의 능력을 평가할 수 있는 신뢰할 수 있는 프레임워크를 구축하고, 이를 통해 국제 관계 분석을 위한 더 정확하고 신뢰할 수 있는 모델 개발에 기여하고자 합니다.

English

Recent advancements in Large Language Models (LLMs) have empowered LLM agents to autonomously collect world information, over which to conduct reasoning to solve complex problems. Given this capability, increasing interests have been put into employing LLM agents for predicting international events, which can influence decision-making and shape policy development on an international scale. Despite such a growing interest, there is a lack of a rigorous benchmark of LLM agents' forecasting capability and reliability. To address this gap, we introduce MIRAI, a novel benchmark designed to systematically evaluate LLM agents as temporal forecasters in the context of international events. Our benchmark features an agentic environment with tools for accessing an extensive database of historical, structured events and textual news articles. We refine the GDELT event database with careful cleaning and parsing to curate a series of relational prediction tasks with varying forecasting horizons, assessing LLM agents' abilities from short-term to long-term forecasting. We further implement APIs to enable LLM agents to utilize different tools via a code-based interface. In summary, MIRAI comprehensively evaluates the agents' capabilities in three dimensions: 1) autonomously source and integrate critical information from large global databases; 2) write codes using domain-specific APIs and libraries for tool-use; and 3) jointly reason over historical knowledge from diverse formats and time to accurately predict future events. Through comprehensive benchmarking, we aim to establish a reliable framework for assessing the capabilities of LLM agents in forecasting international events, thereby contributing to the development of more accurate and trustworthy models for international relation analysis.