生成式人工智慧超越LLM：多模式生成的系統影響

摘要

隨著大規模生成式人工智慧模型的發展從文本（1D）生成擴展到包括圖像（2D）和視頻（3D）生成，處理空間和時間信息帶來了質量、性能和效率方面的獨特挑戰。我們首次提出了針對多模態文本到圖像（TTI）和文本到視頻（TTV）生成模型的新系統設計空間的研究。目前的模型架構設計分為兩類：擴散式和Transformer-based模型。我們對一套八個代表性TTI/TTV模型進行系統性能特徵化，結果顯示，在應用了最先進的優化技術如Flash Attention後，對於基於擴散的TTI模型，卷積佔執行時間的高達44％，而對於基於Transformer的模型，線性層佔執行時間的高達49％。我們還觀察到，基於擴散的TTI模型類似於LLM推理的預填階段，並且從Flash Attention獲得的加速比Transformer-based的TTI模型更高，速度提升為1.1-2.5倍，而Transformer-based的TTI模型則類似於解碼階段。由於針對LLM設計的優化並不直接適用於TTI/TTV模型，我們必須對這些工作負載進行全面特徵化，以獲得新的優化機會。在這樣做的過程中，我們定義了TTI/TTV模型的序列長度，並觀察到在擴散模型推理中，序列長度可以高達4倍。此外，我們還觀察到TTV工作負載的時間方面構成了獨特的系統瓶頸，其中時間關注佔總注意時間的60％以上。總的來說，我們深入的系統性能特徵化是設計高效且可部署系統以應對新興TTI/TTV工作負載的關鍵第一步。

English

As the development of large-scale Generative AI models evolve beyond text (1D) generation to include image (2D) and video (3D) generation, processing spatial and temporal information presents unique challenges to quality, performance, and efficiency. We present the first work towards understanding this new system design space for multi-modal text-to-image (TTI) and text-to-video (TTV) generation models. Current model architecture designs are bifurcated into 2 categories: Diffusion- and Transformer-based models. Our systematic performance characterization on a suite of eight representative TTI/TTV models shows that after state-of-the-art optimization techniques such as Flash Attention are applied, Convolution accounts for up to 44% of execution time for Diffusion-based TTI models, while Linear layers consume up to 49% of execution time for Transformer-based models. We additionally observe that Diffusion-based TTI models resemble the Prefill stage of LLM inference, and benefit from 1.1-2.5x greater speedup from Flash Attention than Transformer-based TTI models that resemble the Decode phase. Since optimizations designed for LLMs do not map directly onto TTI/TTV models, we must conduct a thorough characterization of these workloads to gain insights for new optimization opportunities. In doing so, we define sequence length in the context of TTI/TTV models and observe sequence length can vary up to 4x in Diffusion model inference. We additionally observe temporal aspects of TTV workloads pose unique system bottlenecks, with Temporal Attention accounting for over 60% of total Attention time. Overall, our in-depth system performance characterization is a critical first step towards designing efficient and deployable systems for emerging TTI/TTV workloads.

生成式人工智慧超越LLM：多模式生成的系統影響

Generative AI Beyond LLMs: System Implications of Multi-Modal Generation

摘要

Summary

Support