聚焦引导:从视频扩散模型的语义弱层中解锁可控性
Focal Guidance: Unlocking Controllability from Semantic-Weak Layers in Video Diffusion Models
January 12, 2026
作者: Yuanyang Yin, Yufan Deng, Shenghai Yuan, Kaipeng Zhang, Xiao Yang, Feng Zhao
cs.AI
摘要
图像到视频(I2V)生成任务旨在根据参考图像和文本提示合成视频。这要求扩散模型在去噪过程中兼顾高频视觉约束与低频文本引导。然而,现有I2V模型虽注重视觉一致性,却鲜有研究探讨如何有效耦合这种双重引导以确保对文本提示的强遵循性。本研究发现,基于扩散Transformer(DiT)的I2V模型中,部分中间层会表现出弱语义响应(称为语义弱层),其表现为文本-视觉相似度的可测量下降。我们将此归因于"条件隔离"现象:视觉特征注意力会部分脱离文本引导,过度依赖学习到的视觉先验。为此,我们提出聚焦引导(FG)方法,通过增强语义弱层的可控性来解决该问题。FG包含双重机制:(1)细粒度语义引导(FSG)利用CLIP识别参考帧关键区域作为锚点来引导语义弱层;(2)注意力缓存将语义响应层的注意力图迁移至语义弱层,注入显式语义信号以减轻其对模型视觉先验的过度依赖,从而提升文本指令遵循能力。为验证方法有效性并弥补该方向评估体系的缺失,我们构建了I2V模型指令遵循评估基准。实验表明聚焦引导具有显著有效性和泛化性:在Wan2.1-I2V上将总分提升至0.7250(+3.97%),基于MMDiT的HunyuanVideo-I2V得分提升至0.5571(+7.44%)。
English
The task of Image-to-Video (I2V) generation aims to synthesize a video from a reference image and a text prompt. This requires diffusion models to reconcile high-frequency visual constraints and low-frequency textual guidance during the denoising process. However, while existing I2V models prioritize visual consistency, how to effectively couple this dual guidance to ensure strong adherence to the text prompt remains underexplored. In this work, we observe that in Diffusion Transformer (DiT)-based I2V models, certain intermediate layers exhibit weak semantic responses (termed Semantic-Weak Layers), as indicated by a measurable drop in text-visual similarity. We attribute this to a phenomenon called Condition Isolation, where attention to visual features becomes partially detached from text guidance and overly relies on learned visual priors. To address this, we propose Focal Guidance (FG), which enhances the controllability from Semantic-Weak Layers. FG comprises two mechanisms: (1) Fine-grained Semantic Guidance (FSG) leverages CLIP to identify key regions in the reference frame and uses them as anchors to guide Semantic-Weak Layers. (2) Attention Cache transfers attention maps from semantically responsive layers to Semantic-Weak Layers, injecting explicit semantic signals and alleviating their over-reliance on the model's learned visual priors, thereby enhancing adherence to textual instructions. To further validate our approach and address the lack of evaluation in this direction, we introduce a benchmark for assessing instruction following in I2V models. On this benchmark, Focal Guidance proves its effectiveness and generalizability, raising the total score on Wan2.1-I2V to 0.7250 (+3.97\%) and boosting the MMDiT-based HunyuanVideo-I2V to 0.5571 (+7.44\%).