EmoVid:面向情感中心化视频理解与生成的多模态情感视频数据集
EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation
November 14, 2025
作者: Zongyang Qiu, Bingyuan Wang, Xingbei Chen, Yingqing He, Zeyu Wang
cs.AI
摘要
情感在视频表达中具有核心地位,但现有视频生成系统主要关注低层次视觉指标而忽视情感维度。尽管情感分析在视觉领域已取得进展,视频界仍缺乏专门资源来连接情感理解与生成任务,尤其在风格化非现实场景中。为此,我们推出EmoVid——首个专为创意媒体设计的多模态情感标注视频数据集,包含卡通动画、电影片段和动态贴纸。每个视频均标注情感标签、视觉属性(明度、色彩饱和度、色调)及文字描述。通过系统分析,我们揭示了不同视频形式中视觉特征与情感感知的时空关联模式。基于这些发现,我们通过微调Wan2.1模型开发了情感条件视频生成技术。实验表明,该方法在文本到视频和图像到视频任务中,生成视频的量化指标与视觉质量均有显著提升。EmoVid为情感化视频计算设立了新基准。本研究不仅为艺术风格视频的视觉情感分析提供了宝贵见解,更为增强视频生成中的情感表达提供了实用方法。
English
Emotion plays a pivotal role in video-based expression, but existing video generation systems predominantly focus on low-level visual metrics while neglecting affective dimensions. Although emotion analysis has made progress in the visual domain, the video community lacks dedicated resources to bridge emotion understanding with generative tasks, particularly for stylized and non-realistic contexts. To address this gap, we introduce EmoVid, the first multimodal, emotion-annotated video dataset specifically designed for creative media, which includes cartoon animations, movie clips, and animated stickers. Each video is annotated with emotion labels, visual attributes (brightness, colorfulness, hue), and text captions. Through systematic analysis, we uncover spatial and temporal patterns linking visual features to emotional perceptions across diverse video forms. Building on these insights, we develop an emotion-conditioned video generation technique by fine-tuning the Wan2.1 model. The results show a significant improvement in both quantitative metrics and the visual quality of generated videos for text-to-video and image-to-video tasks. EmoVid establishes a new benchmark for affective video computing. Our work not only offers valuable insights into visual emotion analysis in artistically styled videos, but also provides practical methods for enhancing emotional expression in video generation.