StockBench: LLM 에이전트가 실제 시장에서 수익성 있게 주식 거래를 할 수 있을까?

초록

대규모 언어 모델(LLM)은 최근 자율 에이전트로서의 강력한 능력을 보여주며, 추론, 도구 사용, 순차적 의사결정 분야에서 유망한 가능성을 보이고 있습니다. 이전 벤치마크들은 소프트웨어 공학 및 과학적 발견과 같은 분야에서 LLM 에이전트를 평가했지만, 경제적 가치와 고위험 의사결정과 직접적으로 관련된 금융 분야는 아직 충분히 탐구되지 않았습니다. 기존의 금융 벤치마크는 주로 질문 응답을 통해 정적 지식을 테스트하지만, 거래의 동적이고 반복적인 특성을 포착하는 데는 한계가 있습니다. 이러한 격차를 해결하기 위해, 우리는 현실적인 다중 월간 주식 거래 환경에서 LLM 에이전트를 평가하기 위해 오염 없는 벤치마크인 StockBench을 소개합니다. 에이전트는 가격, 기본 요소, 뉴스 등 일일 시장 신호를 받고 순차적으로 매수, 매도, 또는 보유 결정을 내려야 합니다. 성능은 누적 수익률, 최대 낙폭, 소르티노 비율과 같은 금융 지표를 사용하여 평가됩니다. 최첨단 상용 모델(예: GPT-5, Claude-4)과 오픈 웨이트 모델(예: Qwen3, Kimi-K2, GLM-4.5)을 평가한 결과, 대부분의 LLM 에이전트는 단순한 매수 후 보유 전략을 능가하기 어려웠지만, 일부 모델은 더 높은 수익을 제공하고 위험을 더 효과적으로 관리할 수 있는 잠재력을 보였습니다. 이러한 결과는 LLM 기반 금융 에이전트 개발의 도전과 기회를 동시에 강조하며, 정적 금융 지식 작업에서 우수한 성적을 거두는 것이 반드시 성공적인 거래 전략으로 이어지지는 않음을 보여줍니다. 우리는 StockBench을 오픈소스 리소스로 공개하여 재현성을 지원하고 이 분야의 미래 연구를 발전시키고자 합니다.

English

Large language models (LLMs) have recently demonstrated strong capabilities as autonomous agents, showing promise in reasoning, tool use, and sequential decision-making. While prior benchmarks have evaluated LLM agents in domains such as software engineering and scientific discovery, the finance domain remains underexplored, despite its direct relevance to economic value and high-stakes decision-making. Existing financial benchmarks primarily test static knowledge through question answering, but they fall short of capturing the dynamic and iterative nature of trading. To address this gap, we introduce StockBench, a contamination-free benchmark designed to evaluate LLM agents in realistic, multi-month stock trading environments. Agents receive daily market signals -- including prices, fundamentals, and news -- and must make sequential buy, sell, or hold decisions. Performance is assessed using financial metrics such as cumulative return, maximum drawdown, and the Sortino ratio. Our evaluation of state-of-the-art proprietary (e.g., GPT-5, Claude-4) and open-weight (e.g., Qwen3, Kimi-K2, GLM-4.5) models shows that while most LLM agents struggle to outperform the simple buy-and-hold baseline, several models demonstrate the potential to deliver higher returns and manage risk more effectively. These findings highlight both the challenges and opportunities in developing LLM-powered financial agents, showing that excelling at static financial knowledge tasks does not necessarily translate into successful trading strategies. We release StockBench as an open-source resource to support reproducibility and advance future research in this domain.