焦点引导：释放视频扩散模型中语义弱层的可控性

摘要

图像到视频生成任务旨在根据参考图像和文本提示合成视频。这要求扩散模型在去噪过程中协调高频视觉约束与低频文本引导。然而，尽管现有I2V模型优先考虑视觉一致性，但如何有效耦合这种双重引导以确保对文本提示的强遵循性仍待深入探索。本研究发现，基于扩散Transformer的I2V模型中，部分中间层会表现出弱语义响应（称为语义薄弱层），其文本-视觉相似度指标存在可测量的下降。我们将此归因于"条件隔离"现象——视觉特征的注意力机制与文本引导部分脱节，过度依赖学习到的视觉先验。为此，我们提出焦点引导技术，通过两种机制增强语义薄弱层的可控性：（1）细粒度语义引导利用CLIP识别参考帧关键区域作为锚点指导语义薄弱层；（2）注意力缓存将语义响应层的注意力图迁移至语义薄弱层，注入显式语义信号以减轻其对视觉先验的过度依赖。为验证方法有效性并弥补该方向评估体系的缺失，我们构建了I2V模型指令遵循评估基准。实验表明焦点引导具有显著效果和泛化能力：在Wan2.1-I2V上将总分提升至0.7250（+3.97%），基于MMDiT的HunyuanVideo-I2V得分提升至0.5571（+7.44%）。

English

The task of Image-to-Video (I2V) generation aims to synthesize a video from a reference image and a text prompt. This requires diffusion models to reconcile high-frequency visual constraints and low-frequency textual guidance during the denoising process. However, while existing I2V models prioritize visual consistency, how to effectively couple this dual guidance to ensure strong adherence to the text prompt remains underexplored. In this work, we observe that in Diffusion Transformer (DiT)-based I2V models, certain intermediate layers exhibit weak semantic responses (termed Semantic-Weak Layers), as indicated by a measurable drop in text-visual similarity. We attribute this to a phenomenon called Condition Isolation, where attention to visual features becomes partially detached from text guidance and overly relies on learned visual priors. To address this, we propose Focal Guidance (FG), which enhances the controllability from Semantic-Weak Layers. FG comprises two mechanisms: (1) Fine-grained Semantic Guidance (FSG) leverages CLIP to identify key regions in the reference frame and uses them as anchors to guide Semantic-Weak Layers. (2) Attention Cache transfers attention maps from semantically responsive layers to Semantic-Weak Layers, injecting explicit semantic signals and alleviating their over-reliance on the model's learned visual priors, thereby enhancing adherence to textual instructions. To further validate our approach and address the lack of evaluation in this direction, we introduce a benchmark for assessing instruction following in I2V models. On this benchmark, Focal Guidance proves its effectiveness and generalizability, raising the total score on Wan2.1-I2V to 0.7250 (+3.97\%) and boosting the MMDiT-based HunyuanVideo-I2V to 0.5571 (+7.44\%).

焦点引导：释放视频扩散模型中语义弱层的可控性

Focal Guidance: Unlocking Controllability from Semantic-Weak Layers in Video Diffusion Models

摘要

Support