K-搜索:基于内在世界模型协同进化的LLM内核生成
K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model
February 22, 2026
作者: Shiyi Cao, Ziming Mao, Joseph E. Gonzalez, Ion Stoica
cs.AI
摘要
GPU内核优化对现代高效机器学习系统至关重要,但由于设计因素的复杂交织与硬件的快速迭代,该领域仍面临严峻挑战。现有自动化方案通常将大语言模型(LLMs)简单视为启发式进化循环中的随机代码生成器,这类方法因缺乏显式规划能力,在处理需要协同多步结构重构的复杂内核时往往表现不佳,常因低效或错误的中间实现而错失潜在优化策略。为此,我们提出基于协同进化世界模型的搜索方法,并据此构建K-Search框架。通过以协同进化世界模型替代静态搜索启发式规则,我们的框架能利用LLMs的领域先验知识引导搜索过程,主动探索优化空间。该方法显式解耦了高层算法规划与底层程序实例化,使系统能够驾驭非单调的优化路径,同时对临时实现缺陷保持容错能力。我们在FlashInfer的多样化复杂内核(包括GQA、MLA及MoE内核)上评估K-Search,结果表明其显著优于当前最先进的进化搜索方法,平均性能提升达2.10倍,在复杂MoE内核上最高可实现14.3倍增益。在GPUMode TriMul任务中,K-Search于H100上实现1030微秒的顶尖性能,超越了既有进化算法与人工设计方案。
English
Optimizing GPU kernels is critical for efficient modern machine learning systems yet remains challenging due to the complex interplay of design factors and rapid hardware evolution. Existing automated approaches typically treat Large Language Models (LLMs) merely as stochastic code generators within heuristic-guided evolutionary loops. These methods often struggle with complex kernels requiring coordinated, multi-step structural transformations, as they lack explicit planning capabilities and frequently discard promising strategies due to inefficient or incorrect intermediate implementations. To address this, we propose Search via Co-Evolving World Model and build K-Search based on this method. By replacing static search heuristics with a co-evolving world model, our framework leverages LLMs' prior domain knowledge to guide the search, actively exploring the optimization space. This approach explicitly decouples high-level algorithmic planning from low-level program instantiation, enabling the system to navigate non-monotonic optimization paths while remaining resilient to temporary implementation defects. We evaluate K-Search on diverse, complex kernels from FlashInfer, including GQA, MLA, and MoE kernels. Our results show that K-Search significantly outperforms state-of-the-art evolutionary search methods, achieving an average 2.10x improvement and up to a 14.3x gain on complex MoE kernels. On the GPUMode TriMul task, K-Search achieves state-of-the-art performance on H100, reaching 1030us and surpassing both prior evolution and human-designed solutions.