FEAT: 医療動画生成のための全次元効率的アテンショントランスフォーマー

要旨

高品質な動的医療用ビデオの合成は、空間的一貫性と時間的ダイナミクスの両方をモデル化する必要性から、依然として大きな課題となっています。既存のTransformerベースのアプローチでは、チャネル間の相互作用の不足、セルフアテンションに起因する高い計算複雑性、およびノイズレベルの変化に対処する際のタイムステップ埋め込みによる粗いノイズ除去ガイダンスといった重大な制限が存在します。本研究では、これらの問題に対処するために、FEAT（Full-dimensional Efficient Attention Transformer）を提案します。FEATは、以下の3つの主要な革新を通じてこれらの課題を解決します：(1) すべての次元にわたるグローバルな依存関係を捉えるための逐次的な空間-時間-チャネルアテンションメカニズムを統合したパラダイム、(2) 各次元におけるアテンションメカニズムの線形複雑性設計（重み付きキー-バリューアテンションとグローバルチャネルアテンションを活用）、(3) 異なるノイズレベルに適応するためのピクセルレベルの細かいガイダンスを提供する残差値ガイダンスモジュール。FEATを標準ベンチマークおよび下流タスクで評価した結果、FEAT-Sは、最先端モデルEndoraのパラメータ数のわずか23%で、同等またはそれ以上の性能を達成することが示されました。さらに、FEAT-Lは複数のデータセットにおいてすべての比較手法を上回り、優れた有効性とスケーラビリティを実証しています。コードはhttps://github.com/Yaziwel/FEATで公開されています。

English

Synthesizing high-quality dynamic medical videos remains a significant challenge due to the need for modeling both spatial consistency and temporal dynamics. Existing Transformer-based approaches face critical limitations, including insufficient channel interactions, high computational complexity from self-attention, and coarse denoising guidance from timestep embeddings when handling varying noise levels. In this work, we propose FEAT, a full-dimensional efficient attention Transformer, which addresses these issues through three key innovations: (1) a unified paradigm with sequential spatial-temporal-channel attention mechanisms to capture global dependencies across all dimensions, (2) a linear-complexity design for attention mechanisms in each dimension, utilizing weighted key-value attention and global channel attention, and (3) a residual value guidance module that provides fine-grained pixel-level guidance to adapt to different noise levels. We evaluate FEAT on standard benchmarks and downstream tasks, demonstrating that FEAT-S, with only 23\% of the parameters of the state-of-the-art model Endora, achieves comparable or even superior performance. Furthermore, FEAT-L surpasses all comparison methods across multiple datasets, showcasing both superior effectiveness and scalability. Code is available at https://github.com/Yaziwel/FEAT.

FEAT: 医療動画生成のための全次元効率的アテンショントランスフォーマー

FEAT: Full-Dimensional Efficient Attention Transformer for Medical Video Generation

要旨

Support