FastKernels:在生產環境中對GPU核心生成進行基準測試
FastKernels: Benchmarking GPU Kernel Generation in Production
May 22, 2026
作者: Gabriele Oliaro, Yichao Fu, May Jiang, Owen Lu, Junli Wang, Zhihao Jia, Hao Zhang, Samyam Rajbhandari
cs.AI
摘要
基於大型語言模型的代理在生成GPU內核方面的進展十分迅速,但其進步從根本上受到所優化基準的限制。現有基準與生產推理框架的契合度極低:它們僅在單一GPU上使用合成輸入評估內核,忽略其背後的編譯堆疊,並且獎勵的是複製已知優化而非發現新優化。由此產生的獎勵信號具有誤導性:代理學會生成在測試環境中得分良好,但實際整合到真實系統時卻會帶來接口不兼容、編譯堆疊衝突以及無聲的正確性退化等問題。我們提出FastKernels——一個基於46種代表性架構(涵蓋8大類別)的極簡內核基準,其涵蓋的內核總體上能對應96.2%(409/425)的HuggingFace Transformers架構。FastKernels同時作為一個極簡且達到生產等級的推理框架,在主流的LLM服務中能與vLLM、SGLang等成熟系統性能相當,而在服務不充分的架構上則顯著超越上游參考實現;每個任務的接口皆對應其所屬架構系列中最新函式庫的相應模組,從而能將優化後的內核直接部署至生產級代碼庫。在FastKernels上評估最先進的內核代理時,我們發現即使最強的代理相較於生產基線也僅實現0.94倍的總體加速,而較弱的代理則分別為0.78倍和0.53倍——這證實了基準與生產之間的錯位是該領域的關鍵瓶頸。我們開源FastKernels,期望它能成為一塊墊腳石,讓代理在基準上所取得的收益能直接轉化為生產吞吐量的提升。代碼見https://github.com/Snowflake-AI-Research/fastkernels。
English
LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones. The resulting reward signals are misleading: agents learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation-stack conflicts, and silent correctness degradation when integrated into real systems. We introduce FastKernels, a kernel benchmark built around a minimal set of 46 representative architectures spanning 8 categories, whose kernels collectively subsume those of 96.2% (409/425) of HuggingFace Transformers architectures. FastKernels doubles as a minimalistic, production-grade inference framework that runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving and substantially exceeds upstream references on under-served architectures; each task's interface mirrors the corresponding module in the state-of-the-art library for its architecture family, enabling direct deployment of optimized kernels into production codebases. Evaluating state-of-the-art kernel agents on FastKernels, we find that even the strongest agent achieves only 0.94times aggregate speedup over production baselines, with weaker agents at 0.78times and 0.53times -- confirming that benchmark-production misalignment is a critical bottleneck for the field. We release FastKernels as a stepping stone toward kernel agents whose benchmark gains translate directly into production throughput improvements. Code is available at https://github.com/Snowflake-AI-Research/fastkernels