FutureX: 미래 예측을 위한 LLM 에이전트의 고급 실시간 벤치마크

초록

미래 예측은 LLM 에이전트에게 분석적 사고, 정보 수집, 맥락 이해, 불확실성 하의 의사결정 등 높은 수준의 능력을 요구하는 복잡한 과제입니다. 에이전트는 방대한 양의 동적 정보를 수집하고 해석할 뿐만 아니라 다양한 데이터 소스를 통합하고, 불확실성을 고려하며, 신흥 트렌드에 기반해 예측을 조정해야 합니다. 이는 정치, 경제, 금융 등 분야에서 인간 전문가들이 수행하는 작업과 유사합니다. 그 중요성에도 불구하고, 실시간 업데이트 처리와 시의적절한 정확한 답변 검색의 어려움으로 인해 미래 예측을 평가하기 위한 대규모 벤치마크는 존재하지 않았습니다. 이를 해결하기 위해, 우리는 미래 예측 작업을 수행하는 LLM 에이전트를 위해 특별히 설계된 동적이고 실시간 평가 벤치마크인 FutureX를 소개합니다. FutureX는 미래 예측을 위한 가장 크고 다양한 실시간 벤치마크로, 실시간 일일 업데이트를 지원하며 자동화된 질문 수집 및 답변 수집 파이프라인을 통해 데이터 오염을 제거합니다. 우리는 추론 및 검색 능력을 갖춘 모델과 오픈소스 Deep Research Agent 및 클로즈드소스 Deep Research 모델과 같은 외부 도구 통합 모델을 포함한 25개의 LLM/에이전트 모델을 평가합니다. 이 포괄적인 평가는 동적 환경에서 에이전트의 적응적 추론과 성능을 평가합니다. 또한, 가짜 웹 페이지에 대한 취약성과 시간적 유효성을 포함한 미래 지향적 작업에서 에이전트의 실패 모드와 성능 결함에 대한 심층 분석을 제공합니다. 우리의 목표는 복잡한 추론과 예측적 사고에서 전문 인간 분석가 수준의 성능을 발휘할 수 있는 LLM 에이전트의 개발을 촉진하는 동적이고 오염 없는 평가 기준을 확립하는 것입니다.

English

Future prediction is a complex task for LLM agents, requiring a high level of analytical thinking, information gathering, contextual understanding, and decision-making under uncertainty. Agents must not only gather and interpret vast amounts of dynamic information but also integrate diverse data sources, weigh uncertainties, and adapt predictions based on emerging trends, just as human experts do in fields like politics, economics, and finance. Despite its importance, no large-scale benchmark exists for evaluating agents on future prediction, largely due to challenges in handling real-time updates and retrieving timely, accurate answers. To address this, we introduce FutureX, a dynamic and live evaluation benchmark specifically designed for LLM agents performing future prediction tasks. FutureX is the largest and most diverse live benchmark for future prediction, supporting real-time daily updates and eliminating data contamination through an automated pipeline for question gathering and answer collection. We evaluate 25 LLM/agent models, including those with reasoning, search capabilities, and integration of external tools such as the open-source Deep Research Agent and closed-source Deep Research models. This comprehensive evaluation assesses agents' adaptive reasoning and performance in dynamic environments. Additionally, we provide in-depth analyses of agents' failure modes and performance pitfalls in future-oriented tasks, including the vulnerability to fake web pages and the temporal validity. Our goal is to establish a dynamic, contamination-free evaluation standard that drives the development of LLM agents capable of performing at the level of professional human analysts in complex reasoning and predictive thinking.

FutureX: 미래 예측을 위한 LLM 에이전트의 고급 실시간 벤치마크

FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction

초록

Support