Naar simulatie van menselijk gedrag in de echte wereld: Het benchmarken van grote taalmodellen op lange-termijn, cross-scenario, heterogene gedragssporen

Samenvatting

De opkomst van grootschalige taalmodel(len) (Large Language Models, LLMs) heeft het potentieel voor een algemeen bruikbare gebruikerssimulator belicht. Bestaande benchmarks blijven echter beperkt tot geïsoleerde scenario's, smalle actieruimten of synthetische data, waardoor zij de holistische aard van authentiek menselijk gedrag niet kunnen vatten. Om deze kloof te overbruggen, introduceren wij OmniBehavior, de eerste gebruikerssimulatiebenchmark die volledig is opgebouwd uit real-world data en die langetermijn-, cross-scenario- en heterogene gedragspatronen integreert in een uniform raamwerk. Op basis van deze benchmark leveren wij eerst empirisch bewijs dat eerdere datasets met geïsoleerde scenario's lijden aan tunnelvisie, terwijl real-world besluitvorming steunt op langetermijn- en cross-scenario causaalrelaties. Uitgebreide evaluaties van state-of-the-art LLMs onthullen dat huidige modellen moeite hebben om deze complexe gedragingen accuraat te simuleren, waarbij de prestaties een plateau bereiken zelfs wanneer de contextvensters worden vergroot. Cruciaal is dat een systematische vergelijking tussen gesimuleerd en authentiek gedrag een fundamentele structurele bias blootlegt: LLMs neigen ernaar te convergeren naar een positief gemiddeld persoon, waarbij zij hyperactiviteit, persona-homogenisering en een utopische bias vertonen. Dit resulteert in het verlies van individuele verschillen en long-tail-gedragingen, wat cruciale richtingen voor toekomstig hoogfidelity simulatieonderzoek benadrukt.

English

The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Based on this benchmark, we first provide empirical evidence that previous datasets with isolated scenarios suffer from tunnel vision, whereas real-world decision-making relies on long-term, cross-scenario causal chains. Extensive evaluations of state-of-the-art LLMs reveal that current models struggle to accurately simulate these complex behaviors, with performance plateauing even as context windows expand. Crucially, a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a Utopian bias. This results in the loss of individual differences and long-tail behaviors, highlighting critical directions for future high-fidelity simulation research.

Naar simulatie van menselijk gedrag in de echte wereld: Het benchmarken van grote taalmodellen op lange-termijn, cross-scenario, heterogene gedragssporen

Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

Samenvatting

Support