生成式人工智慧超越LLM:多模式生成的系統影響
Generative AI Beyond LLMs: System Implications of Multi-Modal Generation
December 22, 2023
作者: Alicia Golden, Samuel Hsia, Fei Sun, Bilge Acun, Basil Hosmer, Yejin Lee, Zachary DeVito, Jeff Johnson, Gu-Yeon Wei, David Brooks, Carole-Jean Wu
cs.AI
摘要
隨著大規模生成式人工智慧模型的發展從文本(1D)生成擴展到包括圖像(2D)和視頻(3D)生成,處理空間和時間信息帶來了質量、性能和效率方面的獨特挑戰。我們首次提出了針對多模態文本到圖像(TTI)和文本到視頻(TTV)生成模型的新系統設計空間的研究。目前的模型架構設計分為兩類:擴散式和Transformer-based模型。我們對一套八個代表性TTI/TTV模型進行系統性能特徵化,結果顯示,在應用了最先進的優化技術如Flash Attention後,對於基於擴散的TTI模型,卷積佔執行時間的高達44%,而對於基於Transformer的模型,線性層佔執行時間的高達49%。我們還觀察到,基於擴散的TTI模型類似於LLM推理的預填階段,並且從Flash Attention獲得的加速比Transformer-based的TTI模型更高,速度提升為1.1-2.5倍,而Transformer-based的TTI模型則類似於解碼階段。由於針對LLM設計的優化並不直接適用於TTI/TTV模型,我們必須對這些工作負載進行全面特徵化,以獲得新的優化機會。在這樣做的過程中,我們定義了TTI/TTV模型的序列長度,並觀察到在擴散模型推理中,序列長度可以高達4倍。此外,我們還觀察到TTV工作負載的時間方面構成了獨特的系統瓶頸,其中時間關注佔總注意時間的60%以上。總的來說,我們深入的系統性能特徵化是設計高效且可部署系統以應對新興TTI/TTV工作負載的關鍵第一步。
English
As the development of large-scale Generative AI models evolve beyond text
(1D) generation to include image (2D) and video (3D) generation, processing
spatial and temporal information presents unique challenges to quality,
performance, and efficiency. We present the first work towards understanding
this new system design space for multi-modal text-to-image (TTI) and
text-to-video (TTV) generation models. Current model architecture designs are
bifurcated into 2 categories: Diffusion- and Transformer-based models. Our
systematic performance characterization on a suite of eight representative
TTI/TTV models shows that after state-of-the-art optimization techniques such
as Flash Attention are applied, Convolution accounts for up to 44% of execution
time for Diffusion-based TTI models, while Linear layers consume up to 49% of
execution time for Transformer-based models. We additionally observe that
Diffusion-based TTI models resemble the Prefill stage of LLM inference, and
benefit from 1.1-2.5x greater speedup from Flash Attention than
Transformer-based TTI models that resemble the Decode phase. Since
optimizations designed for LLMs do not map directly onto TTI/TTV models, we
must conduct a thorough characterization of these workloads to gain insights
for new optimization opportunities. In doing so, we define sequence length in
the context of TTI/TTV models and observe sequence length can vary up to 4x in
Diffusion model inference. We additionally observe temporal aspects of TTV
workloads pose unique system bottlenecks, with Temporal Attention accounting
for over 60% of total Attention time. Overall, our in-depth system performance
characterization is a critical first step towards designing efficient and
deployable systems for emerging TTI/TTV workloads.Summary
AI-Generated Summary