ChatPaper.aiChatPaper

VideoLights:特征细化和跨任务对齐变换器,用于联合视频亮点检测和时刻检索。

VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval

December 2, 2024
作者: Dhiman Paul, Md Rizwan Parvez, Nabeel Mohammed, Shafin Rahman
cs.AI

摘要

视频重点检测和时刻检索(HD/MR)在视频分析中至关重要。最近的联合预测变压器模型经常忽略了它们之间的跨任务动态以及视频文本对齐和细化。此外,大多数模型通常使用有限的单向注意机制,导致集成表示薄弱,难以捕捉视频和文本模态之间的相互依赖关系,性能不佳。尽管大语言模型和视觉语言模型(LLM/LVLMs)在各个领域已经备受关注,但它们在这一领域的应用仍相对未被充分探索。在这里,我们提出了VideoLights,这是一个新颖的HD/MR框架,通过以下方式解决了这些限制:(i)具有对齐损失的卷积投影和特征细化模块,以实现更好的视频文本特征对齐,(ii)双向跨模态融合网络,用于强耦合的查询感知剪辑表示,以及(iii)通过相关性增强两个任务的单向联合任务反馈机制。此外,(iv)我们引入了硬正/负损失,用于自适应错误惩罚和改进学习,以及(v)利用像BLIP-2这样的LVLMs进行增强的多模态特征集成和智能预训练,使用从LVLMs生成的合成数据。在QVHighlights、TVSum和Charades-STA基准测试上进行的全面实验表明了最先进的性能。代码和模型可在https://github.com/dpaul06/VideoLights 获取。
English
Video Highlight Detection and Moment Retrieval (HD/MR) are essential in video analysis. Recent joint prediction transformer models often overlook their cross-task dynamics and video-text alignment and refinement. Moreover, most models typically use limited, uni-directional attention mechanisms, resulting in weakly integrated representations and suboptimal performance in capturing the interdependence between video and text modalities. Although large-language and vision-language models (LLM/LVLMs) have gained prominence across various domains, their application in this field remains relatively underexplored. Here we propose VideoLights, a novel HD/MR framework addressing these limitations through (i) Convolutional Projection and Feature Refinement modules with an alignment loss for better video-text feature alignment, (ii) Bi-Directional Cross-Modal Fusion network for strongly coupled query-aware clip representations, and (iii) Uni-directional joint-task feedback mechanism enhancing both tasks through correlation. In addition, (iv) we introduce hard positive/negative losses for adaptive error penalization and improved learning, and (v) leverage LVLMs like BLIP-2 for enhanced multimodal feature integration and intelligent pretraining using synthetic data generated from LVLMs. Comprehensive experiments on QVHighlights, TVSum, and Charades-STA benchmarks demonstrate state-of-the-art performance. Codes and models are available at https://github.com/dpaul06/VideoLights .

Summary

AI-Generated Summary

PDF42December 4, 2024