LiveTradeBench:运用大型语言模型探寻现实世界中的阿尔法收益
LiveTradeBench: Seeking Real-World Alpha with Large Language Models
November 5, 2025
作者: Haofei Yu, Fenghai Li, Jiaxuan You
cs.AI
摘要
大型语言模型(LLMs)在各类基准测试中表现优异——从知识问答、数学推理到网络智能体任务——但这些测试均处于静态环境,缺乏真实的动态性与不确定性。因此,它们评估的是孤立的推理或问题解决能力,而非不确定情境下的决策能力。为此,我们推出LiveTradeBench——一个实时交易环境,用于在真实且持续变化的市场中评估LLM智能体。LiveTradeBench遵循三大设计原则:(一)实时市场行情与新闻数据流,摆脱对离线回测的依赖,杜绝信息泄露,同时捕捉实时不确定性;(二)投资组合管理抽象机制,将控制范围从单一资产操作扩展至多资产配置,整合风险管理与跨资产推理能力;(三)跨市场评估体系,覆盖结构迥异的环境(美股与Polymarket预测市场),其在波动性、流动性和信息流方面存在显著差异。在每个决策步,智能体观察价格、新闻及投资组合状态,随后输出平衡风险与收益的资产配置比例。通过LiveTradeBench,我们对21个不同系列的LLM进行了为期50天的实盘评估。结果表明:(1)高LMArena分数并不等同于优异的交易结果;(2)不同模型展现出反映风险偏好与推理动态的独特投资组合风格;(3)部分LLM能有效利用实时信号调整决策。这些发现揭示了静态评估与现实能力之间的差距,呼吁建立能够检验连续决策能力与实时不确定性下稳定性的新基准。
English
Large language models (LLMs) achieve strong performance across
benchmarks--from knowledge quizzes and math reasoning to web-agent tasks--but
these tests occur in static settings, lacking real dynamics and uncertainty.
Consequently, they evaluate isolated reasoning or problem-solving rather than
decision-making under uncertainty. To address this, we introduce
LiveTradeBench, a live trading environment for evaluating LLM agents in
realistic and evolving markets. LiveTradeBench follows three design principles:
(i) Live data streaming of market prices and news, eliminating dependence on
offline backtesting and preventing information leakage while capturing
real-time uncertainty; (ii) a portfolio-management abstraction that extends
control from single-asset actions to multi-asset allocation, integrating risk
management and cross-asset reasoning; and (iii) multi-market evaluation across
structurally distinct environments--U.S. stocks and Polymarket prediction
markets--differing in volatility, liquidity, and information flow. At each
step, an agent observes prices, news, and its portfolio, then outputs
percentage allocations that balance risk and return. Using LiveTradeBench, we
run 50-day live evaluations of 21 LLMs across families. Results show that (1)
high LMArena scores do not imply superior trading outcomes; (2) models display
distinct portfolio styles reflecting risk appetite and reasoning dynamics; and
(3) some LLMs effectively leverage live signals to adapt decisions. These
findings expose a gap between static evaluation and real-world competence,
motivating benchmarks that test sequential decision making and consistency
under live uncertainty.