ChatPaper.aiChatPaper

LiveTradeBench:運用大型語言模型探索真實世界阿爾法策略

LiveTradeBench: Seeking Real-World Alpha with Large Language Models

November 5, 2025
作者: Haofei Yu, Fenghai Li, Jiaxuan You
cs.AI

摘要

大型語言模型(LLM)在各類基準測試中表現優異——從知識問答、數學推理到網絡代理任務——但這些測試均處於靜態環境中,缺乏真實的動態性與不確定性。因此,它們評估的是孤立推理或問題解決能力,而非不確定情境下的決策能力。為解決此問題,我們推出LiveTradeBench:一個用於在真實演變市場中評估LLM代理的實時交易環境。該平台遵循三項設計原則:(一)即時串流市場價格與新聞數據,擺脫對離線回測的依賴,避免資訊洩漏,同時捕捉實時不確定性;(二)採用投資組合管理抽象架構,將控制範圍從單一資產操作擴展至多資產配置,整合風險管理與跨資產推理能力;(三)跨市場評估機制,涵蓋結構迥異的美股與Polymarket預測市場,兩者在波動性、流動性及資訊流動方面存在顯著差異。在每個決策步驟中,代理需觀察價格、新聞及自身投資組合,隨後輸出能平衡風險與收益的百分比配置方案。通過LiveTradeBench,我們對21個不同系列的大型語言模型進行了為期50天的實時評估。結果表明:(1)高LMArena分數並不意味著更優的交易結果;(2)模型展現出反映風險偏好與推理動態的獨特投資組合風格;(3)部分LLM能有效利用實時信號調整決策。這些發現揭示了靜態評估與現實世界能力之間的落差,推動需建構能測試即時不確定性下序列決策能力與穩定性的基準體系。
English
Large language models (LLMs) achieve strong performance across benchmarks--from knowledge quizzes and math reasoning to web-agent tasks--but these tests occur in static settings, lacking real dynamics and uncertainty. Consequently, they evaluate isolated reasoning or problem-solving rather than decision-making under uncertainty. To address this, we introduce LiveTradeBench, a live trading environment for evaluating LLM agents in realistic and evolving markets. LiveTradeBench follows three design principles: (i) Live data streaming of market prices and news, eliminating dependence on offline backtesting and preventing information leakage while capturing real-time uncertainty; (ii) a portfolio-management abstraction that extends control from single-asset actions to multi-asset allocation, integrating risk management and cross-asset reasoning; and (iii) multi-market evaluation across structurally distinct environments--U.S. stocks and Polymarket prediction markets--differing in volatility, liquidity, and information flow. At each step, an agent observes prices, news, and its portfolio, then outputs percentage allocations that balance risk and return. Using LiveTradeBench, we run 50-day live evaluations of 21 LLMs across families. Results show that (1) high LMArena scores do not imply superior trading outcomes; (2) models display distinct portfolio styles reflecting risk appetite and reasoning dynamics; and (3) some LLMs effectively leverage live signals to adapt decisions. These findings expose a gap between static evaluation and real-world competence, motivating benchmarks that test sequential decision making and consistency under live uncertainty.
PDF112December 1, 2025