品牌融合：一种面向文本到视频生成中无缝品牌整合的多智能体框架

摘要

文本到视频（T2V）模型的快速发展虽已彻底改变内容创作模式，但其商业潜力仍待充分挖掘。本文首次提出T2V无缝品牌植入任务：在保持用户意图语义保真度的前提下，将广告主品牌自动嵌入提示词生成的视频中。该任务面临三大核心挑战：保持提示词忠实度、确保品牌可识别性、实现情境自然融合。为此，我们提出创新性多智能体框架BrandFusion，其协同工作流程包含两个阶段。离线阶段（面向广告主）通过探测模型先验知识并采用轻量化微调适配新品牌，构建品牌知识库；在线阶段（面向用户）则由五个智能体基于共享知识库与实时情境追踪，通过迭代优化共同完善用户提示词，确保品牌可见度与语义一致性。在多个前沿T2V模型上对18个成熟品牌和2个定制品牌的实验表明，BrandFusion在语义保持、品牌识别度与融合自然度上显著超越基线方法。人工评估进一步证实其能提升用户满意度，为T2V技术的可持续商业化提供了可行路径。

English

The rapid advancement of text-to-video (T2V) models has revolutionized content creation, yet their commercial potential remains largely untapped. We introduce, for the first time, the task of seamless brand integration in T2V: automatically embedding advertiser brands into prompt-generated videos while preserving semantic fidelity to user intent. This task confronts three core challenges: maintaining prompt fidelity, ensuring brand recognizability, and achieving contextually natural integration. To address them, we propose BrandFusion, a novel multi-agent framework comprising two synergistic phases. In the offline phase (advertiser-facing), we construct a Brand Knowledge Base by probing model priors and adapting to novel brands via lightweight fine-tuning. In the online phase (user-facing), five agents jointly refine user prompts through iterative refinement, leveraging the shared knowledge base and real-time contextual tracking to ensure brand visibility and semantic alignment. Experiments on 18 established and 2 custom brands across multiple state-of-the-art T2V models demonstrate that BrandFusion significantly outperforms baselines in semantic preservation, brand recognizability, and integration naturalness. Human evaluations further confirm higher user satisfaction, establishing a practical pathway for sustainable T2V monetization.