FEAT: 의료 영상 생성을 위한 전체 차원 효율적 어텐션 트랜스포머

초록

고품질의 동적 의료 영상 합성은 공간적 일관성과 시간적 동역학을 모두 모델링해야 하기 때문에 여전히 큰 과제로 남아 있습니다. 기존의 Transformer 기반 접근법들은 불충분한 채널 상호작용, 자기 주의 메커니즘(self-attention)으로 인한 높은 계산 복잡성, 그리고 다양한 노이즈 수준을 처리할 때 타임스텝 임베딩(timestep embeddings)으로부터의 거친 노이즈 제거 지도 등의 중요한 한계점을 가지고 있습니다. 본 연구에서는 이러한 문제를 해결하기 위해 FEAT(Full-dimensional Efficient Attention Transformer)를 제안합니다. FEAT는 다음과 같은 세 가지 주요 혁신을 통해 이러한 문제를 해결합니다: (1) 모든 차원에서의 전역적 의존성을 포착하기 위한 순차적 공간-시간-채널 주의 메커니즘을 통합한 통합 패러다임, (2) 각 차원에서의 주의 메커니즘을 위한 선형 복잡도 설계로, 가중치가 적용된 키-값 주의 메커니즘과 전역 채널 주의 메커니즘을 활용, (3) 다양한 노이즈 수준에 적응하기 위한 픽셀 수준의 세밀한 지도를 제공하는 잔차 값 지도 모듈. FEAT는 표준 벤치마크와 하위 작업에서 평가되었으며, 최신 모델인 Endora의 매개변수의 23%만을 사용하는 FEAT-S가 비슷하거나 더 우수한 성능을 달성함을 보여줍니다. 또한, FEAT-L은 여러 데이터셋에서 모든 비교 방법을 능가하며, 우수한 효과성과 확장성을 입증합니다. 코드는 https://github.com/Yaziwel/FEAT에서 확인할 수 있습니다.

English

Synthesizing high-quality dynamic medical videos remains a significant challenge due to the need for modeling both spatial consistency and temporal dynamics. Existing Transformer-based approaches face critical limitations, including insufficient channel interactions, high computational complexity from self-attention, and coarse denoising guidance from timestep embeddings when handling varying noise levels. In this work, we propose FEAT, a full-dimensional efficient attention Transformer, which addresses these issues through three key innovations: (1) a unified paradigm with sequential spatial-temporal-channel attention mechanisms to capture global dependencies across all dimensions, (2) a linear-complexity design for attention mechanisms in each dimension, utilizing weighted key-value attention and global channel attention, and (3) a residual value guidance module that provides fine-grained pixel-level guidance to adapt to different noise levels. We evaluate FEAT on standard benchmarks and downstream tasks, demonstrating that FEAT-S, with only 23\% of the parameters of the state-of-the-art model Endora, achieves comparable or even superior performance. Furthermore, FEAT-L surpasses all comparison methods across multiple datasets, showcasing both superior effectiveness and scalability. Code is available at https://github.com/Yaziwel/FEAT.

FEAT: 의료 영상 생성을 위한 전체 차원 효율적 어텐션 트랜스포머

FEAT: Full-Dimensional Efficient Attention Transformer for Medical Video Generation

초록

Support