KernelEvolve:面向Meta异构AI加速器的智能内核编码规模化框架
KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta
December 29, 2025
作者: Gang Liao, Hongsen Qin, Ying Wang, Alicia Golden, Michael Kuchnik, Yavuz Yetim, Jia Jiunn Ang, Chunli Fu, Yihan He, Samuel Hsia, Zewei Jiang, Dianshi Li, Uladzimir Pashkevich, Varna Puvvada, Feng Shi, Matt Steiner, Ruichao Xiao, Nathan Yan, Xiayu Yu, Zhou Fang, Abdul Zainul-Abedin, Ketan Singh, Hongtao Yu, Wenyuan Chi, Barney Huang, Sean Zhang, Noah Weller, Zach Marine, Wyatt Cook, Carole-Jean Wu, Gaoxiang Liu
cs.AI
摘要
提升深度学习推荐模型(DLRM)的训练与推理速度及效率至关重要,但这一目标面临三大系统挑战——模型架构多样性、内核原语多样性,以及硬件代际与架构异构性。本文提出KernelEvolve这一智能内核编码框架,旨在规模化解决DLRM的异构性问题。该框架以内核规范为输入,通过多层级编程抽象(从Triton/CuTe领域专用语言到底层硬件无关语言),实现跨异构硬件架构的推荐模型内核自动生成与优化,覆盖完整的软硬件优化栈。内核优化过程被建模为基于图的搜索,通过选择策略、通用算子、适应度函数和终止规则动态适配运行时执行环境,并借助检索增强的提示合成技术实现自适应调整。我们设计、实现并部署KernelEvolve,用于优化跨代际英伟达/AMD GPU及Meta自研AI加速器上的多种生产级推荐模型。在公开测试集KernelBench上的验证表明:该框架在三个难度级别的250个测试问题中通过率达100%,在三种异构硬件平台上支持的160个PyTorch ATen算子均实现正确性验证。KernelEvolve将开发周期从数周缩短至数小时,在多样化生产场景和大规模异构AI系统中较PyTorch基线实现显著性能提升。除性能优化外,该框架通过为内部研发的AI硬件提供自动内核生成能力,显著降低了新型AI硬件的编程门槛。
English
Making deep learning recommendation model (DLRM) training and inference fast and efficient is important. However, this presents three key system challenges - model architecture diversity, kernel primitive diversity, and hardware generation and architecture heterogeneity. This paper presents KernelEvolve-an agentic kernel coding framework-to tackle heterogeneity at-scale for DLRM. KernelEvolve is designed to take kernel specifications as input and automate the process of kernel generation and optimization for recommendation model across heterogeneous hardware architectures. KernelEvolve does so by operating at multiple programming abstractions, from Triton and CuTe DSL to low-level hardware agnostic languages, spanning the full hardware-software optimization stack. The kernel optimization process is described as graph-based search with selection policy, universal operator, fitness function, and termination rule, dynamically adapts to runtime execution context through retrieval-augmented prompt synthesis. We designed, implemented, and deployed KernelEvolve to optimize a wide variety of production recommendation models across generations of NVIDIA and AMD GPUs, as well as Meta's AI accelerators. We validate KernelEvolve on the publicly-available KernelBench suite, achieving 100% pass rate on all 250 problems across three difficulty levels, and 160 PyTorch ATen operators across three heterogeneous hardware platforms, demonstrating 100% correctness. KernelEvolve reduces development time from weeks to hours and achieves substantial performance improvements over PyTorch baselines across diverse production use cases and for heterogeneous AI systems at-scale. Beyond performance efficiency improvements, KernelEvolve significantly mitigates the programmability barrier for new AI hardware by enabling automated kernel generation for in-house developed AI hardware.