Kernel-Smith : Une Recette Unifiée pour l'Optimisation Évolutive des Noyaux

Résumé

Nous présentons Kernel-Smith, un cadre pour la génération haute performance de noyaux GPU et d'opérateurs, qui combine un agent évolutif piloté par l'évaluation stable avec une méthode de post-entraînement orientée évolution. Côté agent, Kernel-Smith maintient une population de candidats exécutables et les améliore itérativement en utilisant un archivage des programmes les plus performants et diversifiés, ainsi qu'un retour d'exécution structuré sur la compilation, l'exactitude et l'accélération. Pour rendre cette recherche fiable, nous construisons des services d'évaluation spécifiques aux backends pour Triton sur les GPU NVIDIA et pour Maca sur les GPU MetaX. Côté entraînement, nous convertissons les trajectoires d'évolution à long terme en signaux de supervision centrés sur les étapes et d'apprentissage par renforcement, en conservant les révisions préservant l'exactitude et à gain élevé, afin que le modèle soit optimisé comme un puissant améliorateur local au sein de la boucle évolutive plutôt que comme un générateur ponctuel. Sous un protocole évolutif unifié, Kernel-Smith-235B-RL atteint des performances globales de pointe sur KernelBench avec le backend Nvidia Triton, obtenant le meilleur ratio d'accélération moyen et surpassant les modèles propriétaires de premier plan, y compris Gemini-3.0-pro et Claude-4.6-opus. Nous validons en outre le cadre sur le backend MetaX MACA, où notre Kernel-Smith-MACA-30B surpasse des contreparties à grande échelle telles que DeepSeek-V3.2-think et Qwen3-235B-2507-think, soulignant un potentiel d'adaptation transparente sur des plates-formes hétérogènes. Au-delà des résultats de référence, le même flux de travail produit des contributions en amont pour les systèmes de production, notamment SGLang et LMDeploy, démontrant que l'optimisation de noyaux pilotée par LLM peut passer d'une évaluation contrôlée à un déploiement pratique.

English

We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe. On the agent side, Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness, and speedup. To make this search reliable, we build backend-specific evaluation services for Triton on NVIDIA GPUs and Maca on MetaX GPUs. On the training side, we convert long-horizon evolution trajectories into step-centric supervision and reinforcement learning signals by retaining correctness-preserving, high-gain revisions, so that the model is optimized as a strong local improver inside the evolutionary loop rather than as a one-shot generator. Under a unified evolutionary protocol, Kernel-Smith-235B-RL achieves state-of-the-art overall performance on KernelBench with Nvidia Triton backend, attaining the best average speedup ratio and outperforming frontier proprietary models including Gemini-3.0-pro and Claude-4.6-opus. We further validate the framework on the MetaX MACA backend, where our Kernel-Smith-MACA-30B surpasses large-scale counterparts such as DeepSeek-V3.2-think and Qwen3-235B-2507-think, highlighting potential for seamless adaptation across heterogeneous platforms. Beyond benchmark results, the same workflow produces upstream contributions to production systems including SGLang and LMDeploy, demonstrating that LLM-driven kernel optimization can transfer from controlled evaluation to practical deployment.

Kernel-Smith : Une Recette Unifiée pour l'Optimisation Évolutive des Noyaux

Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

Résumé

Support