FastKernels：在生產環境中對GPU核心生成進行基準測試

摘要

基於大型語言模型的代理在生成GPU內核方面的進展十分迅速，但其進步從根本上受到所優化基準的限制。現有基準與生產推理框架的契合度極低：它們僅在單一GPU上使用合成輸入評估內核，忽略其背後的編譯堆疊，並且獎勵的是複製已知優化而非發現新優化。由此產生的獎勵信號具有誤導性：代理學會生成在測試環境中得分良好，但實際整合到真實系統時卻會帶來接口不兼容、編譯堆疊衝突以及無聲的正確性退化等問題。我們提出FastKernels——一個基於46種代表性架構（涵蓋8大類別）的極簡內核基準，其涵蓋的內核總體上能對應96.2%（409/425）的HuggingFace Transformers架構。FastKernels同時作為一個極簡且達到生產等級的推理框架，在主流的LLM服務中能與vLLM、SGLang等成熟系統性能相當，而在服務不充分的架構上則顯著超越上游參考實現；每個任務的接口皆對應其所屬架構系列中最新函式庫的相應模組，從而能將優化後的內核直接部署至生產級代碼庫。在FastKernels上評估最先進的內核代理時，我們發現即使最強的代理相較於生產基線也僅實現0.94倍的總體加速，而較弱的代理則分別為0.78倍和0.53倍——這證實了基準與生產之間的錯位是該領域的關鍵瓶頸。我們開源FastKernels，期望它能成為一塊墊腳石，讓代理在基準上所取得的收益能直接轉化為生產吞吐量的提升。代碼見https://github.com/Snowflake-AI-Research/fastkernels。

English

LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones. The resulting reward signals are misleading: agents learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation-stack conflicts, and silent correctness degradation when integrated into real systems. We introduce FastKernels, a kernel benchmark built around a minimal set of 46 representative architectures spanning 8 categories, whose kernels collectively subsume those of 96.2% (409/425) of HuggingFace Transformers architectures. FastKernels doubles as a minimalistic, production-grade inference framework that runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving and substantially exceeds upstream references on under-served architectures; each task's interface mirrors the corresponding module in the state-of-the-art library for its architecture family, enabling direct deployment of optimized kernels into production codebases. Evaluating state-of-the-art kernel agents on FastKernels, we find that even the strongest agent achieves only 0.94times aggregate speedup over production baselines, with weaker agents at 0.78times and 0.53times -- confirming that benchmark-production misalignment is a critical bottleneck for the field. We release FastKernels as a stepping stone toward kernel agents whose benchmark gains translate directly into production throughput improvements. Code is available at https://github.com/Snowflake-AI-Research/fastkernels