FastKernels：生产环境中的GPU内核生成基准测试

摘要

基于大语言模型的GPU内核生成代理正在快速发展，但其进步从根本上受限于所优化的基准测试。现有基准测试与生产级推理框架存在严重脱节：它们在单GPU上使用合成输入评估内核，忽视底层编译栈，且奖励机制倾向于复现已知优化而非发现新方案。由此产生的奖励信号具有误导性——代理学会生成在沙盒环境中得分高、但集成实际系统时会出现接口不兼容、编译栈冲突以及无提示的正确性退化等问题。为此，我们提出FastKernels——一个基于最小化46个代表性架构（覆盖8个类别）的内核基准测试，其内核集合覆盖HuggingFace Transformers架构总数的96.2%（409/425）。FastKernels同时充当轻量级生产级推理框架，在主流大语言模型推理服务中与vLLM、SGLang等成熟系统性能持平，在服务不足的架构上则显著超越上游参考实现；每个任务的接口均对标其架构系列最新库的对应模块，支持优化的内核直接部署至生产代码库。在FastKernels上评估最先进的内核代理后，我们发现即使最强大的代理相较于生产基线仅实现0.94倍总加速比，较弱代理则分别为0.78倍和0.53倍——这证实了基准测试与生产环境的错位是该领域的关键瓶颈。我们开源FastKernels，期望其成为将基准测试性能增益直接转化为生产吞吐量提升的垫脚石。代码已发布于https://github.com/Snowflake-AI-Research/fastkernels

English

LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones. The resulting reward signals are misleading: agents learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation-stack conflicts, and silent correctness degradation when integrated into real systems. We introduce FastKernels, a kernel benchmark built around a minimal set of 46 representative architectures spanning 8 categories, whose kernels collectively subsume those of 96.2% (409/425) of HuggingFace Transformers architectures. FastKernels doubles as a minimalistic, production-grade inference framework that runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving and substantially exceeds upstream references on under-served architectures; each task's interface mirrors the corresponding module in the state-of-the-art library for its architecture family, enabling direct deployment of optimized kernels into production codebases. Evaluating state-of-the-art kernel agents on FastKernels, we find that even the strongest agent achieves only 0.94times aggregate speedup over production baselines, with weaker agents at 0.78times and 0.53times -- confirming that benchmark-production misalignment is a critical bottleneck for the field. We release FastKernels as a stepping stone toward kernel agents whose benchmark gains translate directly into production throughput improvements. Code is available at https://github.com/Snowflake-AI-Research/fastkernels