透過多模態大型語言模型描述所見內容以提升影片推薦效果

摘要

現有的視頻推薦系統主要依賴於用戶定義的元數據或由專用編碼器提取的低層次視覺和音頻信號。這些低層次特徵描述了屏幕上的內容，但忽略了更深層的語義，如意圖、幽默和世界知識，這些元素使視頻片段與觀眾產生共鳴。例如，一段30秒的片段僅僅是屋頂上的歌手，還是在土耳其卡帕多西亞的仙女煙囪間拍攝的諷刺模仿？這樣的區別對於個性化推薦至關重要，卻在傳統的編碼流程中無法被捕捉。本文提出了一種簡單的、與推薦系統無關的零微調框架，通過提示現成的多模態大語言模型（MLLM）將每個片段總結為豐富的自然語言描述（例如“一部包含滑稽打鬥和管弦樂刺擊的超級英雄模仿劇”），從而彌合原始內容與用戶意圖之間的差距。我們將MLLM的輸出與最先進的文本編碼器結合，並將其輸入到標準的協同過濾、基於內容和生成式推薦系統中。在模擬用戶與TikTok風格視頻互動的MicroLens-100K數據集上，我們的框架在五種代表性模型中始終超越傳統的視頻、音頻和元數據特徵。我們的研究結果凸顯了利用MLLM作為即時知識提取器來構建更具意圖感知能力的視頻推薦系統的潛力。

English

Existing video recommender systems rely primarily on user-defined metadata or on low-level visual and acoustic signals extracted by specialised encoders. These low-level features describe what appears on the screen but miss deeper semantics such as intent, humour, and world knowledge that make clips resonate with viewers. For example, is a 30-second clip simply a singer on a rooftop, or an ironic parody filmed amid the fairy chimneys of Cappadocia, Turkey? Such distinctions are critical to personalised recommendations yet remain invisible to traditional encoding pipelines. In this paper, we introduce a simple, recommendation system-agnostic zero-finetuning framework that injects high-level semantics into the recommendation pipeline by prompting an off-the-shelf Multimodal Large Language Model (MLLM) to summarise each clip into a rich natural-language description (e.g. "a superhero parody with slapstick fights and orchestral stabs"), bridging the gap between raw content and user intent. We use MLLM output with a state-of-the-art text encoder and feed it into standard collaborative, content-based, and generative recommenders. On the MicroLens-100K dataset, which emulates user interactions with TikTok-style videos, our framework consistently surpasses conventional video, audio, and metadata features in five representative models. Our findings highlight the promise of leveraging MLLMs as on-the-fly knowledge extractors to build more intent-aware video recommenders.

透過多模態大型語言模型描述所見內容以提升影片推薦效果

Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations

摘要

Support