MIRAI:评估用于事件预测的LLM代理
MIRAI: Evaluating LLM Agents for Event Forecasting
July 1, 2024
作者: Chenchen Ye, Ziniu Hu, Yihe Deng, Zijie Huang, Mingyu Derek Ma, Yanqiao Zhu, Wei Wang
cs.AI
摘要
最近大型语言模型(LLMs)的进展使LLM代理能够自主收集世界信息,并进行推理以解决复杂问题。鉴于这种能力,人们越来越倾向于利用LLM代理来预测国际事件,这可以影响决策并塑造国际政策发展。尽管存在这种日益增长的兴趣,但缺乏对LLM代理预测能力和可靠性的严格基准。为了弥补这一空白,我们引入了MIRAI,一个新颖的基准,旨在系统评估LLM代理作为国际事件时间预测者的能力。我们的基准环境具有工具,可访问大量历史结构化事件和文本新闻文章的数据库。我们通过仔细清理和解析完善了GDELT事件数据库,策划了一系列关系预测任务,涵盖不同的预测时间范围,评估LLM代理从短期到长期预测的能力。我们进一步实现了API,使LLM代理能够通过基于代码的接口利用不同工具。总之,MIRAI全面评估了代理在三个方面的能力:1)自主从大型全球数据库中获取和整合关键信息;2)使用领域特定API和库编写代码以使用工具;以及3)共同推理历史知识,涵盖不同格式和时间,以准确预测未来事件。通过全面的基准测试,我们旨在建立一个可靠的框架,评估LLM代理在预测国际事件方面的能力,从而为开发更准确可靠的国际关系分析模型做出贡献。
English
Recent advancements in Large Language Models (LLMs) have empowered LLM agents
to autonomously collect world information, over which to conduct reasoning to
solve complex problems. Given this capability, increasing interests have been
put into employing LLM agents for predicting international events, which can
influence decision-making and shape policy development on an international
scale. Despite such a growing interest, there is a lack of a rigorous benchmark
of LLM agents' forecasting capability and reliability. To address this gap, we
introduce MIRAI, a novel benchmark designed to systematically evaluate LLM
agents as temporal forecasters in the context of international events. Our
benchmark features an agentic environment with tools for accessing an extensive
database of historical, structured events and textual news articles. We refine
the GDELT event database with careful cleaning and parsing to curate a series
of relational prediction tasks with varying forecasting horizons, assessing LLM
agents' abilities from short-term to long-term forecasting. We further
implement APIs to enable LLM agents to utilize different tools via a code-based
interface. In summary, MIRAI comprehensively evaluates the agents' capabilities
in three dimensions: 1) autonomously source and integrate critical information
from large global databases; 2) write codes using domain-specific APIs and
libraries for tool-use; and 3) jointly reason over historical knowledge from
diverse formats and time to accurately predict future events. Through
comprehensive benchmarking, we aim to establish a reliable framework for
assessing the capabilities of LLM agents in forecasting international events,
thereby contributing to the development of more accurate and trustworthy models
for international relation analysis.Summary
AI-Generated Summary