ChatPaper.aiChatPaper

MIRAI:評估用於事件預測的LLM代理

MIRAI: Evaluating LLM Agents for Event Forecasting

July 1, 2024
作者: Chenchen Ye, Ziniu Hu, Yihe Deng, Zijie Huang, Mingyu Derek Ma, Yanqiao Zhu, Wei Wang
cs.AI

摘要

最近大型語言模型(LLMs)的進步使LLM代理能夠自主收集世界信息,並在此基礎上進行推理以解決複雜問題。鑒於這種能力,人們越來越感興趣將LLM代理應用於預測國際事件,這可以影響決策並塑造國際政策發展。儘管存在這種日益增長的興趣,但對於LLM代理的預測能力和可靠性缺乏嚴格的基準。為填補這一空白,我們引入了MIRAI,這是一個新穎的基準,旨在系統評估LLM代理作為國際事件時間預測者的能力。我們的基準具有一個代理環境,配備工具,可訪問大量歷史結構化事件和文本新聞文章的數據庫。我們通過仔細清理和解析來完善GDELT事件數據庫,精心策劃了一系列具有不同預測時間範圍的關聯性預測任務,評估LLM代理從短期到長期預測的能力。我們進一步實現API,使LLM代理能夠通過基於代碼的接口使用不同工具。總之,MIRAI全面評估了代理的能力,包括:1)自主從大型全球數據庫中獲取並整合關鍵信息;2)使用特定領域的API和庫編寫代碼以使用工具;以及3)共同推理歷史知識,從不同格式和時間準確預測未來事件。通過全面的基準測試,我們旨在建立一個可靠的框架,用於評估LLM代理在預測國際事件方面的能力,從而為發展更準確和可信賴的國際關係分析模型做出貢獻。
English
Recent advancements in Large Language Models (LLMs) have empowered LLM agents to autonomously collect world information, over which to conduct reasoning to solve complex problems. Given this capability, increasing interests have been put into employing LLM agents for predicting international events, which can influence decision-making and shape policy development on an international scale. Despite such a growing interest, there is a lack of a rigorous benchmark of LLM agents' forecasting capability and reliability. To address this gap, we introduce MIRAI, a novel benchmark designed to systematically evaluate LLM agents as temporal forecasters in the context of international events. Our benchmark features an agentic environment with tools for accessing an extensive database of historical, structured events and textual news articles. We refine the GDELT event database with careful cleaning and parsing to curate a series of relational prediction tasks with varying forecasting horizons, assessing LLM agents' abilities from short-term to long-term forecasting. We further implement APIs to enable LLM agents to utilize different tools via a code-based interface. In summary, MIRAI comprehensively evaluates the agents' capabilities in three dimensions: 1) autonomously source and integrate critical information from large global databases; 2) write codes using domain-specific APIs and libraries for tool-use; and 3) jointly reason over historical knowledge from diverse formats and time to accurately predict future events. Through comprehensive benchmarking, we aim to establish a reliable framework for assessing the capabilities of LLM agents in forecasting international events, thereby contributing to the development of more accurate and trustworthy models for international relation analysis.

Summary

AI-Generated Summary

PDF183November 28, 2024