Diffusion-RWKV: Diffusion 모델을 위한 RWKV 유사 아키텍처의 확장

초록

트랜스포머(Transformers)는 컴퓨터 비전과 자연어 처리(NLP) 분야에서 혁신적인 발전을 촉진해 왔습니다. 그러나 높은 계산 복잡도는 고해상도 이미지 생성과 같은 장문맥 작업에의 적용에 제약을 가합니다. 본 논문은 NLP에서 사용된 RWKV 모델을 기반으로, 이미지 생성 작업에 적용된 디퓨전 모델에 맞게 수정한 일련의 아키텍처를 소개하며, 이를 Diffusion-RWKV로 명명합니다. 트랜스포머 기반 디퓨전 모델과 유사하게, 우리의 모델은 추가 조건과 함께 시퀀스로 처리된 패치화된 입력을 효율적으로 처리하도록 설계되었으며, 대규모 매개변수와 방대한 데이터셋을 모두 수용할 수 있도록 확장성을 갖추고 있습니다. 이 모델의 독특한 장점은 공간 집계 복잡도를 줄여 고해상도 이미지 처리에 탁월한 능력을 발휘하며, 윈도잉이나 그룹 캐시 작업의 필요성을 없앤다는 점입니다. 조건부 및 무조건부 이미지 생성 작업에 대한 실험 결과는 Diffusion-RWKV가 FID 및 IS 지표에서 기존 CNN 또는 트랜스포머 기반 디퓨전 모델과 동등하거나 더 나은 성능을 달성하면서도 총 계산 FLOP 사용량을 크게 줄인 것을 보여줍니다.

English

Transformers have catalyzed advancements in computer vision and natural language processing (NLP) fields. However, substantial computational complexity poses limitations for their application in long-context tasks, such as high-resolution image generation. This paper introduces a series of architectures adapted from the RWKV model used in the NLP, with requisite modifications tailored for diffusion model applied to image generation tasks, referred to as Diffusion-RWKV. Similar to the diffusion with Transformers, our model is designed to efficiently handle patchnified inputs in a sequence with extra conditions, while also scaling up effectively, accommodating both large-scale parameters and extensive datasets. Its distinctive advantage manifests in its reduced spatial aggregation complexity, rendering it exceptionally adept at processing high-resolution images, thereby eliminating the necessity for windowing or group cached operations. Experimental results on both condition and unconditional image generation tasks demonstrate that Diffison-RWKV achieves performance on par with or surpasses existing CNN or Transformer-based diffusion models in FID and IS metrics while significantly reducing total computation FLOP usage.

Diffusion-RWKV: Diffusion 모델을 위한 RWKV 유사 아키텍처의 확장

Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models

초록

Support