SVBench:视频生成模型在社交推理能力上的评估
SVBench: Evaluation of Video Generation Models on Social Reasoning
December 25, 2025
作者: Wenshuo Peng, Gongxuan Wang, Tianmeng Yang, Chuanhao Li, Xiaojie Xu, Hui He, Kaipeng Zhang
cs.AI
摘要
当前文生视频模型在视觉真实感、运动拟真度及文本-视频对齐方面取得显著进展,但其生成社会一致性行为的能力仍存在根本性局限。与人类能轻松从短暂视觉线索中推断意图、信念、情感及社会规范不同,现有模型往往仅呈现字面场景,未能捕捉背后的因果或心理逻辑。为系统评估这一差距,我们首次提出视频生成中社会推理的基准框架。基于发展心理学与社会心理学的研究成果,该基准将三十个经典社会认知范式归纳为七个核心维度,包括心理状态推断、目标导向行为、联合注意、社会协调、亲社会行为、社会规范及多智能体策略。
为实现这些范式的可操作化,我们开发了完全无需训练的基于智能体的流程:(一)提炼各实验的推理机制;(二)合成多样化视频就绪场景;(三)通过基于线索的批判机制确保概念中立性与难度控制;(四)使用高容量视频语言模型作为评判者,从社会推理的五个可解释维度评估生成视频。借助此框架,我们对七种前沿视频生成系统开展了首次大规模研究。结果表明存在显著性能差距:现代模型虽在表层合理性上表现优异,但在意图识别、信念推理、联合注意及亲社会推断等维度存在系统性缺陷。
(注:根据用户要求,已严格采用学术文献常用术语,保持"智能体""联合注意""亲社会行为"等专业表述的一致性,并确保句式结构符合中文科技论文的表述规范。)
English
Recent text-to-video generation models exhibit remarkable progress in visual realism, motion fidelity, and text-video alignment, yet they remain fundamentally limited in their ability to generate socially coherent behavior. Unlike humans, who effortlessly infer intentions, beliefs, emotions, and social norms from brief visual cues, current models tend to render literal scenes without capturing the underlying causal or psychological logic. To systematically evaluate this gap, we introduce the first benchmark for social reasoning in video generation. Grounded in findings from developmental and social psychology, our benchmark organizes thirty classic social cognition paradigms into seven core dimensions, including mental-state inference, goal-directed action, joint attention, social coordination, prosocial behavior, social norms, and multi-agent strategy. To operationalize these paradigms, we develop a fully training-free agent-based pipeline that (i) distills the reasoning mechanism of each experiment, (ii) synthesizes diverse video-ready scenarios, (iii) enforces conceptual neutrality and difficulty control through cue-based critique, and (iv) evaluates generated videos using a high-capacity VLM judge across five interpretable dimensions of social reasoning. Using this framework, we conduct the first large-scale study across seven state-of-the-art video generation systems. Our results reveal substantial performance gaps: while modern models excel in surface-level plausibility, they systematically fail in intention recognition, belief reasoning, joint attention, and prosocial inference.