基于LLM之外的生成式人工智能：多模态生成的系统影响

摘要

随着大规模生成型人工智能模型的发展，已经超越了文本（1D）生成，开始涵盖图像（2D）和视频（3D）生成，处理空间和时间信息带来了对质量、性能和效率的独特挑战。我们首次提出了对多模态文本到图像（TTI）和文本到视频（TTV）生成模型的新系统设计空间的理解工作。目前的模型架构设计分为两类：扩散式和基于Transformer的模型。我们在八个代表性TTI/TTV模型套件上进行了系统性能表征，结果显示，在应用了Flash Attention等最新优化技术后，对于基于扩散的TTI模型，卷积层占执行时间的高达44％，而对于基于Transformer的模型，线性层占执行时间的高达49％。我们还观察到，基于扩散的TTI模型类似于LLM推理的Prefill阶段，并且从Flash Attention获得的加速比基于Transformer的TTI模型高出1.1-2.5倍，后者类似于解码阶段。由于为LLM设计的优化不能直接映射到TTI/TTV模型，我们必须对这些工作负载进行彻底的表征，以获取新的优化机会。在此过程中，我们定义了TTI/TTV模型的序列长度，并观察到在扩散模型推理中，序列长度可以高达4倍。我们还观察到TTV工作负载的时间方面构成了独特的系统瓶颈，其中时间注意力占总注意力时间的60％以上。总的来说，我们深入的系统性能表征是朝着为新兴的TTI/TTV工作负载设计高效且可部署系统迈出的关键第一步。

English

As the development of large-scale Generative AI models evolve beyond text (1D) generation to include image (2D) and video (3D) generation, processing spatial and temporal information presents unique challenges to quality, performance, and efficiency. We present the first work towards understanding this new system design space for multi-modal text-to-image (TTI) and text-to-video (TTV) generation models. Current model architecture designs are bifurcated into 2 categories: Diffusion- and Transformer-based models. Our systematic performance characterization on a suite of eight representative TTI/TTV models shows that after state-of-the-art optimization techniques such as Flash Attention are applied, Convolution accounts for up to 44% of execution time for Diffusion-based TTI models, while Linear layers consume up to 49% of execution time for Transformer-based models. We additionally observe that Diffusion-based TTI models resemble the Prefill stage of LLM inference, and benefit from 1.1-2.5x greater speedup from Flash Attention than Transformer-based TTI models that resemble the Decode phase. Since optimizations designed for LLMs do not map directly onto TTI/TTV models, we must conduct a thorough characterization of these workloads to gain insights for new optimization opportunities. In doing so, we define sequence length in the context of TTI/TTV models and observe sequence length can vary up to 4x in Diffusion model inference. We additionally observe temporal aspects of TTV workloads pose unique system bottlenecks, with Temporal Attention accounting for over 60% of total Attention time. Overall, our in-depth system performance characterization is a critical first step towards designing efficient and deployable systems for emerging TTI/TTV workloads.

基于LLM之外的生成式人工智能：多模态生成的系统影响

Generative AI Beyond LLMs: System Implications of Multi-Modal Generation

摘要

Support