Kernel-Smith：进化式内核优化的统一方案

摘要

我们提出Kernel-Smith框架——一种结合稳定评估驱动进化智能体与进化导向后训练方案的高性能GPU内核及算子生成系统。在智能体层面，该框架维护可执行候选种群，通过整合高性能多样化程序档案与结构化执行反馈（包括编译状态、正确性及加速比），实现迭代优化。为确保搜索可靠性，我们为英伟达GPU的Triton和MetaX GPU的Maca分别构建了后端专用评估服务。在训练层面，通过保留保持正确性的高增益修订版本，将长周期进化轨迹转化为以操作为中心的监督学习与强化学习信号，使模型在进化循环中作为强效局部优化器而非单次生成器发挥作用。在统一进化协议下，Kernel-Smith-235B-RL在基于英伟达Triton后端的KernelBench上实现业界领先的综合性能，获得最佳平均加速比，超越包括Gemini-3.0-pro和Claude-4.6-opus在内的前沿专有模型。我们在MetaX MACA后端上的进一步验证表明，Kernel-Smith-MACA-30B优于DeepSeek-V3.2-think、Qwen3-235B-2507-think等大规模模型，凸显了跨异构平台无缝适配的潜力。除基准测试外，该工作流还为SGLang、LMDeploy等生产系统带来上游贡献，证明大语言模型驱动的内核优化能从受控评估有效迁移至实际部署场景。

English

We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe. On the agent side, Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness, and speedup. To make this search reliable, we build backend-specific evaluation services for Triton on NVIDIA GPUs and Maca on MetaX GPUs. On the training side, we convert long-horizon evolution trajectories into step-centric supervision and reinforcement learning signals by retaining correctness-preserving, high-gain revisions, so that the model is optimized as a strong local improver inside the evolutionary loop rather than as a one-shot generator. Under a unified evolutionary protocol, Kernel-Smith-235B-RL achieves state-of-the-art overall performance on KernelBench with Nvidia Triton backend, attaining the best average speedup ratio and outperforming frontier proprietary models including Gemini-3.0-pro and Claude-4.6-opus. We further validate the framework on the MetaX MACA backend, where our Kernel-Smith-MACA-30B surpasses large-scale counterparts such as DeepSeek-V3.2-think and Qwen3-235B-2507-think, highlighting potential for seamless adaptation across heterogeneous platforms. Beyond benchmark results, the same workflow produces upstream contributions to production systems including SGLang and LMDeploy, demonstrating that LLM-driven kernel optimization can transfer from controlled evaluation to practical deployment.