実世界の人間行動シミュレーションに向けて：長期的・複数シナリオ・異種行動軌跡における大規模言語モデルのベンチマーキング

要旨

大規模言語モデル（LLM）の登場は、汎用ユーザーシミュレータの可能性を明らかにした。しかし、既存のベンチマークは孤立したシナリオ、限定的な行動空間、または合成データに制限されており、人間の実際の行動の全体的な性質を捉えられていない。このギャップを埋めるため、我々は実世界データから完全に構築された初のユーザーシミュレーション基準「OmniBehavior」を提案する。これは長期にわたる行動、クロスシナリオ、異種行動パターンを統一的枠組みに統合したものである。この基準に基づき、まず孤立したシナリオを用いた従来のデータセットが視野の狭窄に陥っている一方で、実世界の意思決定が長期かつ複数シナリオにわたる因果連鎖に依存していることを実証的に示す。先進的なLLMを用いた大規模評価により、現行のモデルがこうした複雑な行動を正確にシミュレートするのに苦戦し、コンテキストウィンドウが拡大しても性能が頭打ちになることが明らかになった。特に重要なのは、シミュレート行動と実際の行動の体系的比較を通じて、LLMが「平均的なポジティブ人物」へ収束する根本的な構造的バイアス（過活動・人物像の均質化・ユートピア的偏り）を発見した点である。これにより個人差やロングテール行動が失われており、高精度シミュレーション研究の重要な方向性が浮き彫りとなった。

English

The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Based on this benchmark, we first provide empirical evidence that previous datasets with isolated scenarios suffer from tunnel vision, whereas real-world decision-making relies on long-term, cross-scenario causal chains. Extensive evaluations of state-of-the-art LLMs reveal that current models struggle to accurately simulate these complex behaviors, with performance plateauing even as context windows expand. Crucially, a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a Utopian bias. This results in the loss of individual differences and long-tail behaviors, highlighting critical directions for future high-fidelity simulation research.

実世界の人間行動シミュレーションに向けて：長期的・複数シナリオ・異種行動軌跡における大規模言語モデルのベンチマーキング

Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

要旨

Support