面向检索增强生成的大型视频库

摘要

视频内容创作者需要高效的工具来重新利用内容，这通常需要复杂的手动或自动搜索。从大型视频库中制作新视频仍然是一个挑战。在本文中，我们介绍了视频库问答（VLQA）任务，通过一个可互操作的架构，将检索增强生成（RAG）应用于视频库。我们提出了一个系统，利用大型语言模型（LLMs）生成搜索查询，检索由语音和视觉元数据索引的相关视频片段。然后，一个答案生成模块将用户查询与这些元数据集成，生成带有特定视频时间戳的响应。这种方法在多媒体内容检索和AI辅助视频内容创作方面显示出潜力。

English

Video content creators need efficient tools to repurpose content, a task that often requires complex manual or automated searches. Crafting a new video from large video libraries remains a challenge. In this paper we introduce the task of Video Library Question Answering (VLQA) through an interoperable architecture that applies Retrieval Augmented Generation (RAG) to video libraries. We propose a system that uses large language models (LLMs) to generate search queries, retrieving relevant video moments indexed by speech and visual metadata. An answer generation module then integrates user queries with this metadata to produce responses with specific video timestamps. This approach shows promise in multimedia content retrieval, and AI-assisted video content creation.

面向检索增强生成的大型视频库

Towards Retrieval Augmented Generation over Large Video Libraries

摘要

Support