LLM을 넘어선 생성형 AI: 다중 모드 생성의 시스템적 함의

초록

대규모 생성형 AI 모델의 발전이 텍스트(1D) 생성에서 이미지(2D) 및 비디오(3D) 생성으로 확장됨에 따라, 공간적 및 시간적 정보를 처리하는 것은 품질, 성능 및 효율성 측면에서 독특한 도전 과제를 제시합니다. 본 연구는 다중 모달 텍스트-이미지(TTI) 및 텍스트-비디오(TTV) 생성 모델을 위한 새로운 시스템 설계 공간을 이해하기 위한 첫 번째 작업을 소개합니다. 현재 모델 아키텍처 설계는 크게 Diffusion 기반과 Transformer 기반 모델로 나뉩니다. 8개의 대표적인 TTI/TTV 모델에 대한 체계적인 성능 특성 분석 결과, Flash Attention과 같은 최신 최적화 기술을 적용한 후, Diffusion 기반 TTI 모델의 경우 Convolution이 실행 시간의 최대 44%를 차지하는 반면, Transformer 기반 모델에서는 Linear 레이어가 실행 시간의 최대 49%를 소비하는 것으로 나타났습니다. 또한, Diffusion 기반 TTI 모델은 LLM 추론의 Prefill 단계와 유사하며, Flash Attention으로부터 Transformer 기반 TTI 모델(Decode 단계와 유사)보다 1.1-2.5배 더 큰 속도 향상을 얻는 것으로 관찰되었습니다. LLM을 위해 설계된 최적화 기법이 TTI/TTV 모델에 직접 적용되지 않기 때문에, 이러한 워크로드를 철저히 분석하여 새로운 최적화 기회에 대한 통찰을 얻어야 합니다. 이를 위해, TTI/TTV 모델의 맥락에서 시퀀스 길이를 정의하고, Diffusion 모델 추론에서 시퀀스 길이가 최대 4배까지 변할 수 있음을 관찰했습니다. 또한, TTV 워크로드의 시간적 측면이 독특한 시스템 병목 현상을 유발하며, Temporal Attention이 전체 Attention 시간의 60% 이상을 차지하는 것으로 나타났습니다. 전반적으로, 본 연구의 심층적인 시스템 성능 특성 분석은 신흥 TTI/TTV 워크로드를 위한 효율적이고 배포 가능한 시스템 설계를 위한 중요한 첫걸음입니다.

English

As the development of large-scale Generative AI models evolve beyond text (1D) generation to include image (2D) and video (3D) generation, processing spatial and temporal information presents unique challenges to quality, performance, and efficiency. We present the first work towards understanding this new system design space for multi-modal text-to-image (TTI) and text-to-video (TTV) generation models. Current model architecture designs are bifurcated into 2 categories: Diffusion- and Transformer-based models. Our systematic performance characterization on a suite of eight representative TTI/TTV models shows that after state-of-the-art optimization techniques such as Flash Attention are applied, Convolution accounts for up to 44% of execution time for Diffusion-based TTI models, while Linear layers consume up to 49% of execution time for Transformer-based models. We additionally observe that Diffusion-based TTI models resemble the Prefill stage of LLM inference, and benefit from 1.1-2.5x greater speedup from Flash Attention than Transformer-based TTI models that resemble the Decode phase. Since optimizations designed for LLMs do not map directly onto TTI/TTV models, we must conduct a thorough characterization of these workloads to gain insights for new optimization opportunities. In doing so, we define sequence length in the context of TTI/TTV models and observe sequence length can vary up to 4x in Diffusion model inference. We additionally observe temporal aspects of TTV workloads pose unique system bottlenecks, with Temporal Attention accounting for over 60% of total Attention time. Overall, our in-depth system performance characterization is a critical first step towards designing efficient and deployable systems for emerging TTI/TTV workloads.

LLM을 넘어선 생성형 AI: 다중 모드 생성의 시스템적 함의

Generative AI Beyond LLMs: System Implications of Multi-Modal Generation

초록

Summary

Support