迈向现实人类行为模拟：大型语言模型在长周期、跨场景、异构行为轨迹上的基准测试

摘要

大型语言模型（LLM）的出现揭示了通用用户模拟器的潜力。然而现有基准测试仍局限于孤立场景、狭窄动作空间或合成数据，难以捕捉真实人类行为的整体性。为弥补这一缺陷，我们推出首个完全基于真实世界数据构建的用户模拟基准OmniBehavior，将长周期、跨场景和异构行为模式整合至统一框架中。基于此基准，我们首次通过实证表明：以往孤立场景数据集存在视野局限，而真实世界决策依赖于长期跨场景的因果链。对前沿LLM的广泛评估显示，当前模型难以准确模拟这些复杂行为，即使上下文窗口扩展性能仍停滞不前。关键发现在于，通过系统对比模拟与真实行为，我们揭示了LLM存在根本性结构偏差：模型趋向于收敛至"积极平均人"状态，表现出超活跃度、角色同质化和乌托邦偏见，导致个体差异与长尾行为消失。这一发现为未来高保真模拟研究指明了关键方向。

English

The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Based on this benchmark, we first provide empirical evidence that previous datasets with isolated scenarios suffer from tunnel vision, whereas real-world decision-making relies on long-term, cross-scenario causal chains. Extensive evaluations of state-of-the-art LLMs reveal that current models struggle to accurately simulate these complex behaviors, with performance plateauing even as context windows expand. Crucially, a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a Utopian bias. This results in the loss of individual differences and long-tail behaviors, highlighting critical directions for future high-fidelity simulation research.

迈向现实人类行为模拟：大型语言模型在长周期、跨场景、异构行为轨迹上的基准测试

Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

摘要

Support