VidEmo:面向情感中心化视频基础模型的情感树推理框架
VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models
November 4, 2025
作者: Zhicheng Zhang, Weicheng Wang, Yongjie Zhu, Wenyu Qin, Pengfei Wan, Di Zhang, Jufeng Yang
cs.AI
摘要
近年来,随着视频大语言模型(VideoLLMs)的进步,从视频中理解和预测情绪的研究受到广泛关注。尽管先进方法在视频情绪分析方面已取得进展,但情绪固有的动态性和线索依赖性特征仍带来重大挑战——这些特性使得理解具有合理内在逻辑的复杂演化情绪状态变得困难。为此,我们提出一种新颖的情感线索引导推理框架,以分阶段方式统一基础属性感知、表情分析和高阶情绪理解。该方案的核心是一组专为情绪推理和指令跟随设计的视频情绪基础模型(VidEmo),这些模型经历两阶段调优:首先通过课程式情绪学习注入情绪知识,再采用情感树强化学习进行情绪推理。此外,我们构建了基础数据基础设施,并推出包含210万条多样化指令样本的情感中心细粒度数据集(Emo-CFG)。该数据集涵盖可解释的情感问答、细粒度描述及相关推理依据,为推进情绪理解任务提供了关键资源。实验结果表明,我们的方法在15项面部感知任务中均取得竞争优势,树立了新的里程碑。
English
Understanding and predicting emotion from videos has gathered significant
attention in recent studies, driven by advancements in video large language
models (VideoLLMs). While advanced methods have made progress in video emotion
analysis, the intrinsic nature of emotions poses significant challenges.
Emotions are characterized by dynamic and cues-dependent properties, making it
difficult to understand complex and evolving emotional states with reasonable
rationale. To tackle these challenges, we propose a novel affective cues-guided
reasoning framework that unifies fundamental attribute perception, expression
analysis, and high-level emotional understanding in a stage-wise manner. At the
core of our approach is a family of video emotion foundation models (VidEmo),
specifically designed for emotion reasoning and instruction-following. These
models undergo a two-stage tuning process: first, curriculum emotion learning
for injecting emotion knowledge, followed by affective-tree reinforcement
learning for emotion reasoning. Moreover, we establish a foundational data
infrastructure and introduce a emotion-centric fine-grained dataset (Emo-CFG)
consisting of 2.1M diverse instruction-based samples. Emo-CFG includes
explainable emotional question-answering, fine-grained captions, and associated
rationales, providing essential resources for advancing emotion understanding
tasks. Experimental results demonstrate that our approach achieves competitive
performance, setting a new milestone across 15 face perception tasks.