借助多模态大语言模型描述所见，提升视频推荐效果

摘要

现有的视频推荐系统主要依赖于用户定义的元数据或由专用编码器提取的低层次视觉和声学信号。这些低层次特征描述了屏幕上呈现的内容，却未能捕捉到更深层次的语义，如意图、幽默和世界知识，这些正是让视频片段与观众产生共鸣的关键。例如，一段30秒的视频片段，仅仅是屋顶上的歌手，还是在土耳其卡帕多西亚的仙女烟囱间拍摄的讽刺模仿？此类区分对于个性化推荐至关重要，却往往被传统编码流程所忽视。本文提出了一种简单、与推荐系统无关的零微调框架，通过提示现成的多模态大语言模型（MLLM）将每个视频片段总结为丰富的自然语言描述（如“一部包含滑稽打斗和管弦乐配乐的超级英雄模仿剧”），从而在原始内容与用户意图之间架起桥梁。我们利用MLLM的输出，结合最先进的文本编码器，将其输入标准的协同过滤、基于内容和生成式推荐系统中。在模拟用户与TikTok风格视频互动的MicroLens-100K数据集上，我们的框架在五种代表性模型中均超越了传统的视频、音频和元数据特征。研究结果表明，利用MLLM作为即时知识提取器，有望构建出更懂用户意图的视频推荐系统。

English

Existing video recommender systems rely primarily on user-defined metadata or on low-level visual and acoustic signals extracted by specialised encoders. These low-level features describe what appears on the screen but miss deeper semantics such as intent, humour, and world knowledge that make clips resonate with viewers. For example, is a 30-second clip simply a singer on a rooftop, or an ironic parody filmed amid the fairy chimneys of Cappadocia, Turkey? Such distinctions are critical to personalised recommendations yet remain invisible to traditional encoding pipelines. In this paper, we introduce a simple, recommendation system-agnostic zero-finetuning framework that injects high-level semantics into the recommendation pipeline by prompting an off-the-shelf Multimodal Large Language Model (MLLM) to summarise each clip into a rich natural-language description (e.g. "a superhero parody with slapstick fights and orchestral stabs"), bridging the gap between raw content and user intent. We use MLLM output with a state-of-the-art text encoder and feed it into standard collaborative, content-based, and generative recommenders. On the MicroLens-100K dataset, which emulates user interactions with TikTok-style videos, our framework consistently surpasses conventional video, audio, and metadata features in five representative models. Our findings highlight the promise of leveraging MLLMs as on-the-fly knowledge extractors to build more intent-aware video recommenders.

借助多模态大语言模型描述所见，提升视频推荐效果

Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations

摘要

Support