FEAT：面向医学视频生成的全维度高效注意力Transformer

摘要

合成高质量动态医学视频仍面临重大挑战，这源于需同时建模空间一致性与时间动态性。现有基于Transformer的方法存在关键局限，包括通道交互不足、自注意力机制带来的高计算复杂度，以及处理不同噪声水平时时间步嵌入提供的去噪指导过于粗糙。本研究中，我们提出了FEAT，一种全维度高效注意力Transformer，通过三项关键创新解决上述问题：(1) 采用序列化空间-时间-通道注意力机制的统一范式，以捕捉所有维度上的全局依赖关系；(2) 在各维度上设计线性复杂度的注意力机制，利用加权键值注意力与全局通道注意力；(3) 引入残差值指导模块，提供细粒度像素级指导，以适应不同噪声水平。我们在标准基准测试及下游任务上评估FEAT，结果表明，仅拥有Endora这一当前最优模型23%参数的FEAT-S，实现了相当甚至更优的性能。此外，FEAT-L在多个数据集上超越所有对比方法，展现了卓越的有效性与可扩展性。代码已发布于https://github.com/Yaziwel/FEAT。

English

Synthesizing high-quality dynamic medical videos remains a significant challenge due to the need for modeling both spatial consistency and temporal dynamics. Existing Transformer-based approaches face critical limitations, including insufficient channel interactions, high computational complexity from self-attention, and coarse denoising guidance from timestep embeddings when handling varying noise levels. In this work, we propose FEAT, a full-dimensional efficient attention Transformer, which addresses these issues through three key innovations: (1) a unified paradigm with sequential spatial-temporal-channel attention mechanisms to capture global dependencies across all dimensions, (2) a linear-complexity design for attention mechanisms in each dimension, utilizing weighted key-value attention and global channel attention, and (3) a residual value guidance module that provides fine-grained pixel-level guidance to adapt to different noise levels. We evaluate FEAT on standard benchmarks and downstream tasks, demonstrating that FEAT-S, with only 23\% of the parameters of the state-of-the-art model Endora, achieves comparable or even superior performance. Furthermore, FEAT-L surpasses all comparison methods across multiple datasets, showcasing both superior effectiveness and scalability. Code is available at https://github.com/Yaziwel/FEAT.

FEAT：面向医学视频生成的全维度高效注意力Transformer

FEAT: Full-Dimensional Efficient Attention Transformer for Medical Video Generation

摘要

Support