실제 인간 행동 시뮬레이션을 향하여: 장기적·다중 시나리오·이질적 행동 트레이스에서의 대규모 언어 모델 성능 평가

초록

대규모 언어 모델(LLM)의 등장은 범용 사용자 시뮬레이터의 가능성을 밝혀주었습니다. 그러나 기존 벤치마크는 고립된 시나리오, 제한된 행동 공간 또는 합성 데이터에 국한되어 있어 실제 인간 행동의 전체론적 특성을 제대로 포착하지 못하고 있습니다. 이러한 격차를 해소하기 위해 우리는 실제 데이터로 완전히 구축된 최초의 사용자 시뮬레이션 벤치마크인 OmniBehavior를 소개합니다. 이는 장기적, 시나리오 간, 이질적 행동 패턴을 통합된 프레임워크로 결합합니다. 이 벤치마크를 바탕으로 우리는 먼저 고립된 시나리오를 가진 기존 데이터셋이 터널 비전(tunnel vision) 문제를 겪는 반면, 실제 의사 결정은 장기적이고 시나리오를 넘나드는 인과적 사슬에 의존한다는 실증적 증거를 제시합니다. 최첨단 LLM에 대한 포괄적 평가 결과, 현재 모델들은 이러한 복잡한 행동을 정확하게 시뮬레이션하는 데 어려움을 겪으며, 컨텍스트 윈도우가 확장되어도 성능이 정체되는 것으로 나타났습니다. 무엇보다 시뮬레이션된 행동과 실제 행동 간의 체계적 비교는 근본적인 구조적 편향을 밝혀냈습니다. LLM은 긍정적인 평균인(average person)으로 수렴하는 경향을 보이며, 과도한 활동성, 개성 동질화, 유토피아 편향을 나타냅니다. 이는 개인 간 차이와 롱테일(long-tail) 행동의 상실을 초래하며, 향후 고충실도 시뮬레이션 연구를 위한 중요한 방향을 제시합니다.

English

The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Based on this benchmark, we first provide empirical evidence that previous datasets with isolated scenarios suffer from tunnel vision, whereas real-world decision-making relies on long-term, cross-scenario causal chains. Extensive evaluations of state-of-the-art LLMs reveal that current models struggle to accurately simulate these complex behaviors, with performance plateauing even as context windows expand. Crucially, a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a Utopian bias. This results in the loss of individual differences and long-tail behaviors, highlighting critical directions for future high-fidelity simulation research.

실제 인간 행동 시뮬레이션을 향하여: 장기적·다중 시나리오·이질적 행동 트레이스에서의 대규모 언어 모델 성능 평가

Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

초록

Support