借助多模态大语言模型描述所见,提升视频推荐效果
Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations
August 13, 2025
作者: Marco De Nadai, Andreas Damianou, Mounia Lalmas
cs.AI
摘要
现有的视频推荐系统主要依赖于用户定义的元数据或由专用编码器提取的低层次视觉和声学信号。这些低层次特征描述了屏幕上呈现的内容,却未能捕捉到更深层次的语义,如意图、幽默和世界知识,这些正是让视频片段与观众产生共鸣的关键。例如,一段30秒的视频片段,仅仅是屋顶上的歌手,还是在土耳其卡帕多西亚的仙女烟囱间拍摄的讽刺模仿?此类区分对于个性化推荐至关重要,却往往被传统编码流程所忽视。本文提出了一种简单、与推荐系统无关的零微调框架,通过提示现成的多模态大语言模型(MLLM)将每个视频片段总结为丰富的自然语言描述(如“一部包含滑稽打斗和管弦乐配乐的超级英雄模仿剧”),从而在原始内容与用户意图之间架起桥梁。我们利用MLLM的输出,结合最先进的文本编码器,将其输入标准的协同过滤、基于内容和生成式推荐系统中。在模拟用户与TikTok风格视频互动的MicroLens-100K数据集上,我们的框架在五种代表性模型中均超越了传统的视频、音频和元数据特征。研究结果表明,利用MLLM作为即时知识提取器,有望构建出更懂用户意图的视频推荐系统。
English
Existing video recommender systems rely primarily on user-defined metadata or
on low-level visual and acoustic signals extracted by specialised encoders.
These low-level features describe what appears on the screen but miss deeper
semantics such as intent, humour, and world knowledge that make clips resonate
with viewers. For example, is a 30-second clip simply a singer on a rooftop, or
an ironic parody filmed amid the fairy chimneys of Cappadocia, Turkey? Such
distinctions are critical to personalised recommendations yet remain invisible
to traditional encoding pipelines. In this paper, we introduce a simple,
recommendation system-agnostic zero-finetuning framework that injects
high-level semantics into the recommendation pipeline by prompting an
off-the-shelf Multimodal Large Language Model (MLLM) to summarise each clip
into a rich natural-language description (e.g. "a superhero parody with
slapstick fights and orchestral stabs"), bridging the gap between raw content
and user intent. We use MLLM output with a state-of-the-art text encoder and
feed it into standard collaborative, content-based, and generative
recommenders. On the MicroLens-100K dataset, which emulates user interactions
with TikTok-style videos, our framework consistently surpasses conventional
video, audio, and metadata features in five representative models. Our findings
highlight the promise of leveraging MLLMs as on-the-fly knowledge extractors to
build more intent-aware video recommenders.